I am studying about visual odometry and watched Prof. Dr. Cyrill Stachniss' video recordings which are available as YouTube 2015/16 Playlist about Photogrammetry I & II .
First, If I want to create my own dataset (like KITTI dataset for VO or like Oxford campus dataset) what should be the properties of the image that I take with a camera.
Are they just images? Or, does they have some special properties ? That is, how can I create my own dataset with a monocular or stereo camera.
Thank you.
To get extrinsic and intrinsic parameters from the image you must have a set of images of known shape from varying views. It's not trivial task to do on your own, by common CV libraries / solution have a built-in utilities for camera calibration (I have to deal with OpenCV library and Matlab CV package and they are generally the same).Usually it's done with a black and white checkboard or another simple geometric pattern.
Then with known camera parameters you can manipulate your own dataset.
Matlab camera calibration reference
OpenCV camera calibration tutorials
If you want to benchmark some visual odometry algorithms with your dataset, you will definitely need the intrinsic parameters of your camera as well as its pose.
As said in #f4f answer, the intrinsic calibration is typically done with some images of a checkerboard that you tilt and rotate (see opencv).
This will give you parameters such as focal length, optical center but also the distortion coefficients which can be important depending on your camera.
Getting the pose of the camera (i.e extrinsic parameters) at each frame is probably trickier. Usually the ground-truth is obtained using information from additional sensors (tracking system, IMU, GPS, ...). You can have a look at : TUM RGB-D SLAM Dataset and the corresponding paper. They explain how they used a motion-capture system to get the ground-truth pose.
Recording the time of acquisition of the camera frames can also be interesting (one timestamp per frame).
Creating your own visual odometry dataset is not trivial. If you just want to create a dataset "for fun" or to do some experiments and if you have only a camera available, I would say you can just try some methods that are known to work well (like ORB-SLAM). This will give you good approximate of the camera poses (you may have to manually fix the unknown scale).
Related
I've just gotten started learning about calculating depth from stereo images and before I went and committed to learning about this I wanted to check if it was a viable choice for a project I'm doing. I have a drone with a single rgb camera that has sensors that can give the orientation and movement of the drone. Would it be possible to sample two frames, the distance, and orientation differences between the samples and use this to calculate depth? I've seen in most examples the cameras are lined up horizontally. Is this necessary for stereo images or can I use any reasonable angle and distance between the two sampled images? Would it be feasible to do this in real time? My overall goal is to do some sort of monocular slam to have this drone navigate indoor areas. I know that ORB slam exists but I am mostly doing this for a learning experience and so would like to do things from scratch where possible.
Thank you.
I'm currently trying to "undistort" fisheye imagery using OpenCV in C++. I know the exact lens and camera model, so I figured that I would be able to use this information to calculate some parameters and ultimately convert fisheye images to rectilinear images. However, all the tutorials I've found online encourage using auto-calibration with checkerboards. Is there a way to calibrate the fisheye camera by just using camera + lens parameters and some math? Or do I have to use the checkerboard calibration technique?
I am trying to avoid having to use the checkerboard calibration technique because I am just receiving some images to undistort, and it would be undesirable to have to ask for images of checkerboards if possible. The lens is assumed to retain a constant zoom/focal length for all images.
Thank you so much!
To un-distord an image, you need to know the intrinsic parameters of the camera which describe the distorsion.
You can't compute them from datasheet values, because they depend on how the lens is manufactured and two lenses of the same vendor & model might have different distorsion coefficients, especially if they are cheap one.
Some raster graphics editor embed a lens database from which you can query distorsion coefficients. But there is no magic, they built it by measuring the lens distorsion and eventually interpolate them after.
But you can still use an empiric method to correct at least barrel effect.
They are plenty of shaders to do so and you can always do your own maths to build a distorsion map.
I'm a dev with about a decade of enterprise software engineering under his belt, and my hobbyist interests have steered me into the vast and scary realm of computer vision (CV).
One thing that is not immediately clear to me is the division of labor between Structure from Motion (SFM) tools and Multi View Stereo (MVS) tools.
Specifically, CMVS appears to be the best-in-show MVS tool, and Bundler seems to be one of the better open source SFM tools out there.
Taken from CMVS's own homepage:
You should ALWAYS use CMVS after Bundler and before PMVS2
I'm wondering: why?!? My understanding of SFM tools is that they perform the 3D reconstruction for you, so why do we need MVS tools in the first place? What value/processing/features do they add that SFM tools like Bundler can't address? Why the proposed pipeline of:
Bundler -> CMVS -> PMVS2
?
Quickly put, Structure from Motion (SfM) and MultiView Stereo (MVS) techniques are complementary, as they do not deal with the same assumptions. They also differ slightly in their inputs, MVS requiring camera parameters to run, which is estimated (output) by SfM. SfM only gives a coarse 3D output, whereas PMVS2 gives a more dense output, and finally CMVS is there to circumvent some limitations of PMVS2.
The rest of the answer provides an high-level overview of how each method works, explaining why it is this way.
Structure from Motion
The first step of the 3D reconstruction pipeline you highlighted is a SfM algorithm that could be done using Bundler, VisualSFM, OpenMVG or the like. This algorithm takes in input some images and outputs the camera parameters of each image (more on this later) as well as a coarse 3D shape of the scene, often called the sparse reconstruction.
Why does SfM outputs only a coarse 3D shape? Basically, SfM techniques begins by detecting 2D features in every input image and matching those features between pairs of images. The goal is, for example, to tell "this table corner is located at those pixels locations in those images." Those features are described by what we call descriptors (like SIFT or ORB). Those descriptors are built to represent a small region (ie. a bunch of neighboring pixels) in images. They can represent reliably highly textured or rough geometries (e.g., edges), but these scene features need to be unique (in the sense distinguishing) throughout the scene to be useful. For example (maybe oversimplified), a wall with repetitive patterns would not be very useful for the reconstruction, because even though it is highly textured, every region of the wall could potentially match pretty much everywhere else on the wall. Since SfM is performing a 3D reconstruction using those features, the vertices of the 3D scene reconstruction will be located on those unique textures or edges, giving a coarse mesh as output. SfM won't typically produce a vertex in the middle of surface without precise and distinguishing texture. But, when many matches are found between the images, one can compute a 3D transformation matrix between the images, effectively giving the relative 3D position between the two camera poses.
MultiView Stereo
Afterwards, the MVS algorithm is used to refine the mesh obtained by the SfM technique, resulting in what is called a dense reconstruction. This algorithm requires the camera parameters of each image to work, which is output by the SfM algorithm. As it works on a more constrained problem (since they already have the camera parameters of every image like position, rotation, focal, etc.), MVS will compute 3D vertices on regions which were not (or could not be) correctly detected by descriptors or matched. This is what PMVS2 does.
How can PMVS work on regions where 2D feature descriptor would difficultly match? Since you know the camera parameters, you know a given pixel in an image is the projection of a line in another image. This approach is called epipolar geometry. Whereas SfM had to seek through the entire 2D image for every descriptor to find a potential match, MVS will work on a single 1D line to find matches, simplifying the problem quite a deal. As such, MVS usually takes into account illumination and object materials into its optimization, which SfM does not.
There is one issue, though: PMVS2 performs a quite complex optimization that can be dreadfully slow or take an astronomic amount of memory on large image sequences. This is where CMVS comes into play, clustering the coarse 3D SfM output into regions. PMVS2 will then be called (potentially in parallel) on each cluster, simplifying its execution. CMVS will then merge each PMVS2 output in an unified detailed model.
Conclusion
Most of the information provided in this answer and many more can be found in this tutorial from Yasutaka Furukawa, author of CMVS and PMVS2:
http://www.cse.wustl.edu/~furukawa/papers/fnt_mvs.pdf
In essence, both techniques emerge from two different approaches: SfM aims to perform a 3D reconstruction using a structured (but theunknown) sequence of images while MVS is a generalization of the two-view stereo vision, based on human stereopsis.
We have modified sample code for the C API so Tango pose data (position (x,y,z) and quaternion (x,y,z,w)) is published as PoseStamped ROS messages.
We are attempting to visualize the pose using Rviz. The pose data appears to need some transformation as the rotation of the Rviz arrow does not match the behavior of the Tango when we move it around.
We realize that in the sample code, before visualization on the Tango screen, the pose data is transformed into a 4x4 Pose matrix (function PoseData::GetExtrinsicsAppliedOpenGLWorldFrame), which is then multiplied left and right by various matrices representing changes of coordinate frames (for instance, Tango to OpenGL).
Ideally, we would be able to apply a similar transformation to the pose data before publishing it for visualization. However we must keep the pose data in the position (x,y,z) and quaternion (x,y,z,w) format in order to publish it in a PoseStamped message, and we do not see what transform to apply.
We have looked at the Tango coordinate systems conventions but the transformations the Tango developers suggest we apply are only suited for pose data in a Pose matrix format. We have also attempted to apply transformations applied by Ologic in their code to no avail.
Does anyone have any suggestions on how to transform Tango pose data, without changing its format, for correct visualization on the Rviz OpenGL interface?
If it's OpenGL convention, you will basically need to do a transformation on the left hand side of the pose data. The c++ motion tracking example has a line doing this operation here. You could ignore the rotation part, but just apply following code:
glm::mat4 opengl_world_T_opengl_camera = tango_gl::conversions::opengl_world_T_tango_world() * start_service_T_deivce;
I know that is a late answer but it can maybe help others people.
If you want to visualize any data with Rviz, I assume that you want to use ros. Then maybe the best way to do it is to use the rasjava library to do your Tango android app. It works well for me. I you just have to use poseStamp, odometry and tf publisher on your tango device and then display the topic with rviz. Morever it is one of the best way to keep the real-time aspect.
Moreover here there is 2 good way to learn how to use rosjava :
https://github.com/ologic/Tango
https://github.com/rosjava/android_core/tree/master
I'm studying the use of multiple cameras for computer vision applications. E.g. there is a camera in every corner of the room and the task is human tracking. I would like to simulate this kind of environment. What I need is:
Ability to define dynamic 3D environment, e.g. room and a moving object.
Options to place cameras at different positions and get simulated data set for each camera.
Does anyone have any experience with that? I checked out blender (http://www.blender.org), but currently I'm looking for a faster/easier to use solution.
Could you give me guidance to similar software/libraries (preferably C++ or MATLAB).
you may find ILNumerics perfectly fits your needs:
http://ilnumerics.net
If I get it right! you are looking to simulate camera feed from multiple camera at different positions of an environment.
I dont know of any sites or a working ready made solution, but here is how I would proceed:
Procure 3d point clouds of a dynamic environment (see Kinect 3d slam benchmark datasets) or generate one of your own with Kinect(hoping you have Xbox kinect with you).
Once you got kinect point clouds in PCL point cloud format, you can simulate video feed from various cameras.
A pseudo code such as this will suffice:
#include <pcl_headers>
//this method just discards all 3d depth information and fills the pixels with rgb values
//this is like a snapshot in the pcd_viewer of pcl(point cloud library)
makeImage(cloud,image){};
pcd <- read the point clouds
camera_positions[] <- {new CameraPosition(affine transform)...}
for(camera_position in camera_positions)
pcl::transformPointCloud(pcd,
cloud_out,
camera_position.getAffineTransform()
);
//Now cloud_out contains point cloud in different viewpoint
image <- new Image();
make_image(cloud_out,image);
saveImage(image);
pcl provides a function to transform a point cloud given appropriate parameters pcl::trasformPointCloud()
If you wish not to use pcl then you may wish to check this post and then followed with remaining steps.