In computer vision, what does MVS do that SFM can't?

In computer vision, what does MVS do that SFM can't? - computer-vision

I'm a dev with about a decade of enterprise software engineering under his belt, and my hobbyist interests have steered me into the vast and scary realm of computer vision (CV).
One thing that is not immediately clear to me is the division of labor between Structure from Motion (SFM) tools and Multi View Stereo (MVS) tools.
Specifically, CMVS appears to be the best-in-show MVS tool, and Bundler seems to be one of the better open source SFM tools out there.
Taken from CMVS's own homepage:
You should ALWAYS use CMVS after Bundler and before PMVS2
I'm wondering: why?!? My understanding of SFM tools is that they perform the 3D reconstruction for you, so why do we need MVS tools in the first place? What value/processing/features do they add that SFM tools like Bundler can't address? Why the proposed pipeline of:
Bundler -> CMVS -> PMVS2
?

Quickly put, Structure from Motion (SfM) and MultiView Stereo (MVS) techniques are complementary, as they do not deal with the same assumptions. They also differ slightly in their inputs, MVS requiring camera parameters to run, which is estimated (output) by SfM. SfM only gives a coarse 3D output, whereas PMVS2 gives a more dense output, and finally CMVS is there to circumvent some limitations of PMVS2.
The rest of the answer provides an high-level overview of how each method works, explaining why it is this way.
Structure from Motion
The first step of the 3D reconstruction pipeline you highlighted is a SfM algorithm that could be done using Bundler, VisualSFM, OpenMVG or the like. This algorithm takes in input some images and outputs the camera parameters of each image (more on this later) as well as a coarse 3D shape of the scene, often called the sparse reconstruction.
Why does SfM outputs only a coarse 3D shape? Basically, SfM techniques begins by detecting 2D features in every input image and matching those features between pairs of images. The goal is, for example, to tell "this table corner is located at those pixels locations in those images." Those features are described by what we call descriptors (like SIFT or ORB). Those descriptors are built to represent a small region (ie. a bunch of neighboring pixels) in images. They can represent reliably highly textured or rough geometries (e.g., edges), but these scene features need to be unique (in the sense distinguishing) throughout the scene to be useful. For example (maybe oversimplified), a wall with repetitive patterns would not be very useful for the reconstruction, because even though it is highly textured, every region of the wall could potentially match pretty much everywhere else on the wall. Since SfM is performing a 3D reconstruction using those features, the vertices of the 3D scene reconstruction will be located on those unique textures or edges, giving a coarse mesh as output. SfM won't typically produce a vertex in the middle of surface without precise and distinguishing texture. But, when many matches are found between the images, one can compute a 3D transformation matrix between the images, effectively giving the relative 3D position between the two camera poses.
MultiView Stereo
Afterwards, the MVS algorithm is used to refine the mesh obtained by the SfM technique, resulting in what is called a dense reconstruction. This algorithm requires the camera parameters of each image to work, which is output by the SfM algorithm. As it works on a more constrained problem (since they already have the camera parameters of every image like position, rotation, focal, etc.), MVS will compute 3D vertices on regions which were not (or could not be) correctly detected by descriptors or matched. This is what PMVS2 does.
How can PMVS work on regions where 2D feature descriptor would difficultly match? Since you know the camera parameters, you know a given pixel in an image is the projection of a line in another image. This approach is called epipolar geometry. Whereas SfM had to seek through the entire 2D image for every descriptor to find a potential match, MVS will work on a single 1D line to find matches, simplifying the problem quite a deal. As such, MVS usually takes into account illumination and object materials into its optimization, which SfM does not.
There is one issue, though: PMVS2 performs a quite complex optimization that can be dreadfully slow or take an astronomic amount of memory on large image sequences. This is where CMVS comes into play, clustering the coarse 3D SfM output into regions. PMVS2 will then be called (potentially in parallel) on each cluster, simplifying its execution. CMVS will then merge each PMVS2 output in an unified detailed model.
Conclusion
Most of the information provided in this answer and many more can be found in this tutorial from Yasutaka Furukawa, author of CMVS and PMVS2:
http://www.cse.wustl.edu/~furukawa/papers/fnt_mvs.pdf
In essence, both techniques emerge from two different approaches: SfM aims to perform a 3D reconstruction using a structured (but theunknown) sequence of images while MVS is a generalization of the two-view stereo vision, based on human stereopsis.

Related

Visual Odometry, Camera Parameters

I am studying about visual odometry and watched Prof. Dr. Cyrill Stachniss' video recordings which are available as YouTube 2015/16 Playlist about Photogrammetry I & II .
First, If I want to create my own dataset (like KITTI dataset for VO or like Oxford campus dataset) what should be the properties of the image that I take with a camera.
Are they just images? Or, does they have some special properties ? That is, how can I create my own dataset with a monocular or stereo camera.
Thank you.

To get extrinsic and intrinsic parameters from the image you must have a set of images of known shape from varying views. It's not trivial task to do on your own, by common CV libraries / solution have a built-in utilities for camera calibration (I have to deal with OpenCV library and Matlab CV package and they are generally the same).Usually it's done with a black and white checkboard or another simple geometric pattern.
Then with known camera parameters you can manipulate your own dataset.
Matlab camera calibration reference
OpenCV camera calibration tutorials

If you want to benchmark some visual odometry algorithms with your dataset, you will definitely need the intrinsic parameters of your camera as well as its pose.
As said in #f4f answer, the intrinsic calibration is typically done with some images of a checkerboard that you tilt and rotate (see opencv).
This will give you parameters such as focal length, optical center but also the distortion coefficients which can be important depending on your camera.
Getting the pose of the camera (i.e extrinsic parameters) at each frame is probably trickier. Usually the ground-truth is obtained using information from additional sensors (tracking system, IMU, GPS, ...). You can have a look at : TUM RGB-D SLAM Dataset and the corresponding paper. They explain how they used a motion-capture system to get the ground-truth pose.
Recording the time of acquisition of the camera frames can also be interesting (one timestamp per frame).
Creating your own visual odometry dataset is not trivial. If you just want to create a dataset "for fun" or to do some experiments and if you have only a camera available, I would say you can just try some methods that are known to work well (like ORB-SLAM). This will give you good approximate of the camera poses (you may have to manually fix the unknown scale).

random voxel terrain on the gpu (dc vs. cms)

I want to create a random fractal terrain on the gpu (with a compute shader). I've started with implementing marching cubes: Generating Complex Procedural Terrains Using the GPU, and it works really good: marching cubes on the gpu. However, marching cubes can't extract sharp features or use an adaptive resolution.
So I looked for an advanced isosurface extraction algorithm and found Dual Contouring of Hermite Data. So I implemented dc in Java to test this algorithm and it looks great: dc on the cpu. (There are some holes in the mesh and no sharp features, but I was too lazy to implement/fix this because it's only a prototype.)
But I've noticed some negative aspects:
Intercell-dependent. (I have no idea how to port this to the gpu: the only resource I've found is Dual Contouring with OpenCL.)
I don't know how to create a chunk system, because there are no "clear" borders https://i.stack.imgur.com/62dy6.png.
So I continued my search for a better algorithm and found Cubical Marching Squares: Adaptive Feature Preserving Surface Extraction from Volume Data. This seems to be the perfect algorithm for me: intercell-independent, adaptive, sharp features, primal structure and even manifold. Unfortunately, there are no resources on how to implement this algorithm except for this Cubical Marching Squares Implementation. I think I understand the two parts of the algorithm: create a grid, for each cell:
Subdivide until the maximum depth is reached or there is no need to do so.
Split each cell into 6 faces, extract their surface and stitch them together.
But I don't know how to connect those two parts (especially the part with transitional faces, page 38).
So does anybody know how to implement dc as a shader, how to implement cms or a better algorithm (maybe dual marching cubes, I think it has the same problem as dc, but I haven't tested it yet)?

As you have already mentioned for the GPU you would need no intercell-dependencies... I'm sure people have come up with workarounds for that on DC but CMS should be one of the best for all those things that you need, namely intercell-independent (by definition), preserves sharp features, creates manifold geometry, and supports adaptive resolution.
In terms of resources on CMS I agree they are quite limited.
The original paper: http://graphics.csie.ntu.edu.tw/CMS/
This project by Matt Keeter has a 'C' CMS implementation in it (https://www.mattkeeter.com/projects/kokopelli/)
https://github.com/mkeeter/kokopelli <-- code
I used Matt Keeter's implementation as a reference for my partial 'C++' implementation (the thesis that you linked), here is the code for it in case you didn't find it:
https://bitbucket.org/GRassovsky/cubical-marching-squares
However keep in mind that it is indeed partial, that is to say it has the main algorithm working (adaptive and manifold), but I haven't got around to implement the sharp-feature preservation, 2D and 3D disambiguation, etc. It is also currently a basic CPU implementation... I had the good intentions to implement all those things and make a GPU implementation but for now have had no time.
"But I don't know how to connect those two parts" - that only goes to say how badly I have written my thesis, because I try to explain how I do this :D (Mind you, I am not sure exactly how this was done in the original paper... should we say that some things are not very clear there and you would have to use your imagination :))

Neural network topology for object recognition on aerial photos (computer vision)

My objective is to recognize the footprints of buildings on aerial photos. Having heard about recent progress in machine vision (ImageNet Large Scale Visual Recognition Challenges) I though I could (at least) try to use neural networks for this task.
Can anybody give me the idea what should be the topology of such a network? I guess it should have as many outputs as inputs (which means all the pixels in picture) since I want to recognize the outlines of buildings with their (at least approximate) placement on the picture.
I guess the input pictures should be of standard size, with each pixel normalized to grey scale or YUV color space (1 value per color) and maybe normalized resolution (each pixel should represent fixed size in reality). I am not sure if the picture could be preprocessed in any other way before inputting into net, maybe by extracting the edges first?
The tricky part is how the outputs should be represented and how to train the net. Using just e.g. output=0 for the pixel within building footprint and 1 for the pixel outside of it, might not be the best idea. Maybe I should teach the network to recognize edges of the building instead so the pixels which represent building edges should have 1's and 0's for the rest of pixels?
Can anybody throw in some suggestions about network topology/inputs/outputs formats?
Or maybe this task is hopelessly difficult and I have 0 chances to solve it?

I think we need a better definition of "buildings". If you want to do building "detection", that is detect the presence of a building of any shape/size, this is difficult for a cascade classifier. You can try the following, though:
Partition a set of known images to fixed-size blocks.
Label each block as "building", "not building", or
"boundary(includes portions
of both)"
Extract basic features like intensity histograms, edges,
hough lines, HOG, etc.
Train SVM classifiers based on these features (you can try others, too, but I recommend SVM by experience).
Now you can partition your images again and use the trained classifier to get the results. The results will have to be combined to identify buildings.
This will still need some testing to get the parameters(size of histograms, parameters of SVM classifier etc.) right.
I have used this approach to detect "food" regions on images. The accuracy was below 70%, but my guess is that it will be better for buildings.

Generate an image that can be most easily detected by Computer Vision algorithms

Working on a small side project related to Computer Vision, mostly to try playing around with OpenCV. It lead me to an interesting question:
Using feature detection to find known objects in an image isn't always easy- objects are hard to find, especially if the features of the target object aren't great.
But if I could choose ahead of time what it is I'm looking for, then in theory I could generate for myself an optimal image for detection. Any quality that makes feature detection hard would be absent, and all the qualities that make it easy would exist.
I suspect this sort of thought went into things like QR codes, but with the limitations that they wanted QR codes to be simple, and small.
So my question for you: How would you generate an optimal image for later recognition by a camera? What if you already know that certain problems like skew, or partial obscuring would occur?
Thanks very much

I think you need something like AR markers.
Take a look at ArToolkit, ArToolkitPlus or Aruco libraries, they have marker generators and detectors.
And papeer about marker generation: http://www.uco.es/investiga/grupos/ava/sites/default/files/GarridoJurado2014.pdf

If you plan to use feature detection, than marker should be specific to used feature detector. Common practice for detector design is good response to "corners" or regions with high x,y gradients. Also you should note the scaling of target.
The simplest detection can be performed with BLOBS. It can be faster and more robust than feature points. For example you can detect circular blobs or rectangular.

Depending on the distance you want to see your markers from and viewing conditions/backgrounds you typically use and camera resolution/noise you should choose different images/targets. Under moderate perspective from a longer distance a color target is pretty unique, see this:
https://surf-it.soe.ucsc.edu/sites/default/files/velado_report.pdf
at close distances various bar/QR codes may be a good choice. Other than that any flat textured object will be easy to track using homography as opposed to 3D objects.
http://docs.opencv.org/trunk/doc/py_tutorials/py_feature2d/py_feature_homography/py_feature_homography.html
Even different views of 3d objects can be quickly learned and tracked by such systems as Predator:
https://www.youtube.com/watch?v=1GhNXHCQGsM
then comes the whole field of hardware, structured light, synchronized markers, etc, etc. Kinect, for example, uses a predefined pattern projected on the surface to do stereo. This means it recognizes and matches million of micro patterns per second creating a depth map from the matched correspondences. Note that one camera sees the pattern and while another device - a projector generates it working as a virtual camera, see
http://article.wn.com/view/2013/11/17/Apple_to_buy_PrimeSense_technology_from_the_360s_Kinect/
The quickest way to demonstrate good tracking of a standard checkerboard pattern is to use pNp function of open cv:
http://www.juergenwiki.de/work/wiki/lib/exe/fetch.php?media=public:cameracalibration_detecting_fieldcorners_of_a_chessboard.gif
this literally can be done by calling just two functions
found = findChessboardCorners(src, chessboardSize, corners, camFlags);
drawChessCornersDots(dst, chessboardSize, corners, found);
To sum up, your question is very broad and there are multiple answers and solutions. Formulate your viewing condition, camera specs, backgrounds, distances, amount of motion and perspective you expect to have indoors vs outdoors, etc. There is no such a thing as a general average case in computer vision!

IDEAs: how to interactively render large image series using GPU-based direct volume rendering

I'm looking for idea's how to convert a 30+gb, 2000+ colored TIFF image series into a dataset able to be visualized in realtime (interactive frame rates) using GPU-based volume rendering (using OpenCL / OpenGL / GLSL). I want to use a direct volume visualization approach instead of surface fitting (i.e. raycasting instead of marching cubes).
The problem is two-fold, first I need to convert my images into a 3D dataset. The first thing which came into my mind is to see all images as 2D textures and simply stack them to create a 3D texture.
The second problem is the interactive frame rates. For this I will probably need some sort of downsampling in combination with "details-on-demand" loading the high-res dataset when zooming or something.
A first point-wise approach i found is:
polygonization of the complete volume data through layer-by-layer processing and generating corresponding image texture;
carrying out all essential transformations through vertex processor operations;
dividing polygonal slices into smaller fragments, where the corresponding depth and texture coordinates are recorded;
in fragment processing, deploying the vertex shader programming technique to enhance the rendering of fragments.
But I have no concrete ideas of how to start implementing this approach.
I would love to see some fresh ideas or ideas on how to start implementing the approach shown above.

If anyone has any fresh ideas in this area, they're probably going to be trying to develop and publish them. It's an ongoing area of research.
In your "point-wise approach", it seems like you have outlined the basic method of slice-based volume rendering. This can give good results, but many people are switching to a hardware raycasting method. There is an example of this in the CUDA SDK if you are interested.
A good method for hierarchical volume rendering was detailed by Crassin et al. in their paper called Gigavoxels. It uses an octree-based approach, which only loads bricks needed in memory when they are needed.
A very good introductory book in this area is Real-Time Volume Graphics.

I've done a bit of volume rendering, though my code generated an isosurface using marching cubes and displayed that. However, in my modest self-education of volume rendering I did come across an interesting short paper: Volume Rendering on Common Computer Hardware. It comes with source example too. I never got around to checking it out but it seemed promising. It is it DirectX though, not OpenGL. Maybe it can give you some ideas and a place to start.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js