Reducing "bridges" between circular objects from images recorded with an Intel Realsense D400 depth camera - computer-vision

I'm using an Intel Realsense D400 depth camera to record depth images of chicken eggs.
I have noticed that at the points where the eggs are close together, "bridges" are generated in the depth image. See the image below, I've circled some bridges in green.
It looks like the camera is doing interpolation of some sort.
My goal is to do segmentation on the depth image, so these bridges are a problem.
Is there a way to configure the D400 to reduce these bridges?
I am primarily looking to solve this problem by configuring the camera correctly.
Suggestions for computer vision methods are also welcome, but I have already tried different approaches and wasn't successful. (Simple thresholding won't do it. The bridges do not appear at a fixed height.)

Related

stereo depth map but with a single moving camera measured with sensors

I've just gotten started learning about calculating depth from stereo images and before I went and committed to learning about this I wanted to check if it was a viable choice for a project I'm doing. I have a drone with a single rgb camera that has sensors that can give the orientation and movement of the drone. Would it be possible to sample two frames, the distance, and orientation differences between the samples and use this to calculate depth? I've seen in most examples the cameras are lined up horizontally. Is this necessary for stereo images or can I use any reasonable angle and distance between the two sampled images? Would it be feasible to do this in real time? My overall goal is to do some sort of monocular slam to have this drone navigate indoor areas. I know that ORB slam exists but I am mostly doing this for a learning experience and so would like to do things from scratch where possible.
Thank you.

automatically getting edge detection for image alignment

I am trying to do image alignment like posted on adrian blog like this image or in this link.
I want to do image alignment on this kind of image. The problem is I want to automatically detect the 4 point edges which are hard to detect in this kind of images with contour detection like in the tutorial.
Now I can do alignment just fine with manually input edge coordinates. Some of my friends suggest me to detect the edges with dlib landmark detection, but as far as I can see it mostly uses shape in which dlib automatically marking the landmark.
Do I miss something here? Or is there any tutorial or even basic guide about how to do that?
Maybe you can try to detect edges on a Gaussian pyramid. You can find an explanation here https://en.wikipedia.org/wiki/Pyramid_(image_processing). The basic idea is that by filtering with Gaussian filters of increasing size, the small objects are blurred. Thus at some scale, we get only edges of the showcase (maybe need further processing).
Here is the tutorial of opencv on image pyramid: https://opencv-python-tutroals.readthedocs.io/en/latest/py_tutorials/py_imgproc/py_pyramids/py_pyramids.html.
I think wavelet pyramid (do wavelet transform several times) may work for your problem, since wavelet can reduce the details in image.

How to turn any camera into a Depth Camera?

I want to build a depth camera that finds out any image from particular distance. I have already read the following link.
http://www.i-programmer.info/news/194-kinect/7641-microsoft-research-shows-how-to-turn-any-camera-into-a-depth-camera.html
https://jahya.net/blog/how-depth-sensor-works-in-5-minutes/
But couldn't understand clearly which hardware requirements need & how to integrated into all together?
Thanks
Certainly, a depth sensor needs an IR sensor, just like in Kinect or Asus Xtion and other cameras available that provides the depth or range image. However, Microsoft came up with machine learning techniques and using algorithmic modification and research which you can find here. Also here is a video link which shows the mobile camera that has been modified to get depth rendering. But some hardware changes might be necessary if you make a standalone 2D camera into a new performing device. So I would suggest you to see the hardware design of the existing market devices as well.
one way or the other you would need two angles to the same points to get a depth. So search for depth sensors and examples e.g. kinect with ros or openCV or here
also you could transfere two camera streams into a point cloud but that's another story
Here's what I know:
3D Cameras
RGBD and Stereoscopic cameras are popular for these applications but are not always practical / available. I've prototyped with Kinects (v1,v2) and intel cameras (r200,d435). Certainly those are preferred even today.
2D Cameras
IF YOU WANT TO USE RGB DATA FOR DEPTH INFO then you need to have an algorithm that will process the math for each frame; try an RGB SLAM. A good algo will not process ALL the data every frame but it will process all the data once and then look for clues to support evidence of changes to your scene. A number of BIG companies have already done this (it's not that difficult if you have a big team w big money) think Google, Apple, MSFT, etc etc.
Good luck out there, make something amazing!

In computer vision, what does MVS do that SFM can't?

I'm a dev with about a decade of enterprise software engineering under his belt, and my hobbyist interests have steered me into the vast and scary realm of computer vision (CV).
One thing that is not immediately clear to me is the division of labor between Structure from Motion (SFM) tools and Multi View Stereo (MVS) tools.
Specifically, CMVS appears to be the best-in-show MVS tool, and Bundler seems to be one of the better open source SFM tools out there.
Taken from CMVS's own homepage:
You should ALWAYS use CMVS after Bundler and before PMVS2
I'm wondering: why?!? My understanding of SFM tools is that they perform the 3D reconstruction for you, so why do we need MVS tools in the first place? What value/processing/features do they add that SFM tools like Bundler can't address? Why the proposed pipeline of:
Bundler -> CMVS -> PMVS2
?
Quickly put, Structure from Motion (SfM) and MultiView Stereo (MVS) techniques are complementary, as they do not deal with the same assumptions. They also differ slightly in their inputs, MVS requiring camera parameters to run, which is estimated (output) by SfM. SfM only gives a coarse 3D output, whereas PMVS2 gives a more dense output, and finally CMVS is there to circumvent some limitations of PMVS2.
The rest of the answer provides an high-level overview of how each method works, explaining why it is this way.
Structure from Motion
The first step of the 3D reconstruction pipeline you highlighted is a SfM algorithm that could be done using Bundler, VisualSFM, OpenMVG or the like. This algorithm takes in input some images and outputs the camera parameters of each image (more on this later) as well as a coarse 3D shape of the scene, often called the sparse reconstruction.
Why does SfM outputs only a coarse 3D shape? Basically, SfM techniques begins by detecting 2D features in every input image and matching those features between pairs of images. The goal is, for example, to tell "this table corner is located at those pixels locations in those images." Those features are described by what we call descriptors (like SIFT or ORB). Those descriptors are built to represent a small region (ie. a bunch of neighboring pixels) in images. They can represent reliably highly textured or rough geometries (e.g., edges), but these scene features need to be unique (in the sense distinguishing) throughout the scene to be useful. For example (maybe oversimplified), a wall with repetitive patterns would not be very useful for the reconstruction, because even though it is highly textured, every region of the wall could potentially match pretty much everywhere else on the wall. Since SfM is performing a 3D reconstruction using those features, the vertices of the 3D scene reconstruction will be located on those unique textures or edges, giving a coarse mesh as output. SfM won't typically produce a vertex in the middle of surface without precise and distinguishing texture. But, when many matches are found between the images, one can compute a 3D transformation matrix between the images, effectively giving the relative 3D position between the two camera poses.
MultiView Stereo
Afterwards, the MVS algorithm is used to refine the mesh obtained by the SfM technique, resulting in what is called a dense reconstruction. This algorithm requires the camera parameters of each image to work, which is output by the SfM algorithm. As it works on a more constrained problem (since they already have the camera parameters of every image like position, rotation, focal, etc.), MVS will compute 3D vertices on regions which were not (or could not be) correctly detected by descriptors or matched. This is what PMVS2 does.
How can PMVS work on regions where 2D feature descriptor would difficultly match? Since you know the camera parameters, you know a given pixel in an image is the projection of a line in another image. This approach is called epipolar geometry. Whereas SfM had to seek through the entire 2D image for every descriptor to find a potential match, MVS will work on a single 1D line to find matches, simplifying the problem quite a deal. As such, MVS usually takes into account illumination and object materials into its optimization, which SfM does not.
There is one issue, though: PMVS2 performs a quite complex optimization that can be dreadfully slow or take an astronomic amount of memory on large image sequences. This is where CMVS comes into play, clustering the coarse 3D SfM output into regions. PMVS2 will then be called (potentially in parallel) on each cluster, simplifying its execution. CMVS will then merge each PMVS2 output in an unified detailed model.
Conclusion
Most of the information provided in this answer and many more can be found in this tutorial from Yasutaka Furukawa, author of CMVS and PMVS2:
http://www.cse.wustl.edu/~furukawa/papers/fnt_mvs.pdf
In essence, both techniques emerge from two different approaches: SfM aims to perform a 3D reconstruction using a structured (but theunknown) sequence of images while MVS is a generalization of the two-view stereo vision, based on human stereopsis.

IDEAs: how to interactively render large image series using GPU-based direct volume rendering

I'm looking for idea's how to convert a 30+gb, 2000+ colored TIFF image series into a dataset able to be visualized in realtime (interactive frame rates) using GPU-based volume rendering (using OpenCL / OpenGL / GLSL). I want to use a direct volume visualization approach instead of surface fitting (i.e. raycasting instead of marching cubes).
The problem is two-fold, first I need to convert my images into a 3D dataset. The first thing which came into my mind is to see all images as 2D textures and simply stack them to create a 3D texture.
The second problem is the interactive frame rates. For this I will probably need some sort of downsampling in combination with "details-on-demand" loading the high-res dataset when zooming or something.
A first point-wise approach i found is:
polygonization of the complete volume data through layer-by-layer processing and generating corresponding image texture;
carrying out all essential transformations through vertex processor operations;
dividing polygonal slices into smaller fragments, where the corresponding depth and texture coordinates are recorded;
in fragment processing, deploying the vertex shader programming technique to enhance the rendering of fragments.
But I have no concrete ideas of how to start implementing this approach.
I would love to see some fresh ideas or ideas on how to start implementing the approach shown above.
If anyone has any fresh ideas in this area, they're probably going to be trying to develop and publish them. It's an ongoing area of research.
In your "point-wise approach", it seems like you have outlined the basic method of slice-based volume rendering. This can give good results, but many people are switching to a hardware raycasting method. There is an example of this in the CUDA SDK if you are interested.
A good method for hierarchical volume rendering was detailed by Crassin et al. in their paper called Gigavoxels. It uses an octree-based approach, which only loads bricks needed in memory when they are needed.
A very good introductory book in this area is Real-Time Volume Graphics.
I've done a bit of volume rendering, though my code generated an isosurface using marching cubes and displayed that. However, in my modest self-education of volume rendering I did come across an interesting short paper: Volume Rendering on Common Computer Hardware. It comes with source example too. I never got around to checking it out but it seemed promising. It is it DirectX though, not OpenGL. Maybe it can give you some ideas and a place to start.