Efficient Stereo matching Algoirithm in Windows - c++

I want to compute the disparity map in Windows platform. I have used several codes on the internet, but I can't compute precision disparity map. I have used Opencv SGBM algorithm, but the disparity map was very noisy. could any one introduce an efficient code?
Thanks in advance for you

Well, stereo vision is always a decision between speed and accuracy. You can't have both. SGM or SGBM are already doing a pretty good compromise here. If you want very high precision you can move towards true global algorithms such as belief propagation, but expect several minutes computation time for processing a single camera frame.
If you want fast processing, you can use block matching. But here you will get poor performance on low texture surfaces and effects like foreground fattening.
The only way to speed things up is to switch to a hardware that is capable of massively parallel processing. There exist some great FPGA-based solutions for stereo vision like this one:
https://nerian.com/products/sp1-stereo-vision/
Also using a (high-end) graphics card would be an option. There exist several research papers with proposed implementations and there might be some code available somewhere.

Related

Can CNNs be faster than classic descriptors?

Disclamer: I don't know almost nothing on CNNs and I have no idea where I could ask this.
My research is focused on high performance on computer vision applications. We generate codes representing an image in less than 20 ms on images with the largest size of 500pxs.
This is done by combining SURF descriptors and VLAD codes, obtaining a vector representing an image that will be used in our object recognition application.
Can CNNs be faster? According to this benchmark (which is based on much smaller images) the times needed is longer, almost double considering that the size of the image is half of ours.
Yes, they can be faster. The numbers you got are for networks trained for ImageNet classification, 1 Million images, 1000 classes. Unless your classification problem is similar, then using a ImageNet network is overkill.
You should also remember that these networks have weights in the order of 10-100 million, so they are quite expensive to evaluate. But you probably don't need a really big network, and you can design your own network, with less layers and parameters that is much cheaper to evaluate.
In my experience, I designed a network to classify 96x96 sonar image patches, and with around 4000 weights in total, it can get over 95% classification accuracy and run at 40 ms per frame on a RPi2.
A bigger network with 900K weights, same input size, takes 7 ms to evaluate on a Core i7. So this is surely possible, you just need to play with smaller network architectures. A good start is SqueezeNet, which is a network that can achieve good performance in Imagenet, but has 50 times less weights, and it is of course much faster than other networks.
I would be wary of benchmarks and blanket statements. It's important to know every detail that went into generating the quoted values. For example, would running CNN on GPU hardware improve the quoted values?
20ms seems very fast to me; so does 40ms. I have no idea what your requirement is.
What other benefits could CNN offer? Maybe it's more than just raw speed.
I don't believe that neural networks are the perfect technique for every problem. Regression, SVM, and other classification techniques are still viable.
There's a bias at work here. Your question reads as if you are looking only to confirm that your current research is best. You have a sunk cost that you're loath to throw away, but you're worried that there might be something better out there. If that's true, I don't think this is a good question for SO.
"I don't know almost nothing on CNNs" - if you're a true researcher, seeking the truth, I think you have an obligation to learn and answer for yourself. TensorFlow and Keras make this easy to do.
Answer to your question: Yes, they can. They can be slower and they can be faster than classic descriptors. For example, using only a single filter and several max-poolings will almost certainly be faster. But the results will also certainly be crappy.
You should ask a much more specific question. Relevant parts are:
Problem: Classification / Detection / Semantic Segmentation / Instance Segmentation / Face verification / ... ?
Constraints: Minimum accuracy / maximum speed / maximum latency?
Evaluation specifics:
Which hardware is available (GPUs)?
Do you evaluate on a single image? Often you can evaluate up to 512 images in about the same time as one image.
Also: The input image size should not be relevant. If CNNs achieve better results on smaller inputs than classic descriptors, why should you care?
Papers
Please note that CNNs are usually not tweaked towards speed, but towards accuracy.
Detection: Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks: 600px x ~800px in 200ms on a GPU
InverseFaceNet: Deep Single-Shot Inverse Face Rendering From A Single Image: 9.79ms with GeForce GTX Titan and AlexNet to get FC7 features
Semantic segmentation: Pixel-wise Segmentation of Street with Neural
Networks 20ms with GeForce GTX 980

Improved SGBM based on previous frames result

I was wondering if there is any good method to make SGBM process faster, by taking the info from the previous video frame.
I think that it can be made faster by searching correspondences only near the distance of the disparity of previous frame. The problem I see in this is when from one frame to the next, the block passes from an object to background of viceversa. I think, in case to be possible, is an interesting improve to be made, and I have looked for it but I didn't find it.
You have told what is the problem, if the scene is in motion.
I managed to wrote some algorithm that take in consideration the critical zone around the objects' borders, they were a little more accurate but very slower than SGBM.
Maybe you can simply set the maximum and the minimum value of disparity in a reasonable range of what you find in the previous frame instead of "safe values". In my experience wuth OpenCV, stereoBM is faster but not so good as SGBM, and SGBM is better optimized than any other algorithm written by oneself (always in my experience).
Maybe you can have some better (faster) result using the CUDA algorithm (SGBM processed in GPU). My group and I are working on that.

Approach to sphere sphere collision

I am trying to implement a sphere sphere collision. I understand the math behind it. However, I am still looking around tutorials to see if there are better and faster approaches. I came across nehe's collision detection tutorial ( http://nehe.gamedev.net/tutorial/collision_detection/17005/ ). In this tutorial, if I understood correctly, he is trying to check if two spheres collides within a frame, and he tries not to miss it, by first checking if their paths intersect. And then tries to simulate it in a way.
My approach was to check every frame, whether spheres are colliding and be done with it. I didnt consider checking intersection paths etc. Now I am kinda confused how to approach to the problem.
My question is, is it really necessary trying to be that safe and check if we missed the collision by one frame ?
When writing collision detection algorithms, it is important to recognize objects are moving at discrete time steps, unlike the real world. In a typical modern game, objects will be moving with a time step of 0.016 seconds per frame (often with smaller fixed or variable time steps).
It is possible that two spheres moving with very high velocities could pass through each other during a single frame and not be within each other's bounding spheres after integration is performed. This problem is called Tunneling and there are multiple ways, each with varying levels of complexity and cost, to approach the problem. Some options are swept volumes and Minkowski Addition.
Choosing the right algorithm depends on your application. How precise does the collision need to be? Is it vital to your application or can you get away with some false negatives/positives. Typically, the more precise the collision detection, the more you pay in performance.
Similar question here

Reducing bandwidth between GPU and CPU( sending raw data or pre calculate first)

OK so I am just trying to work out the best way reduce band width between the GPU and CPU.
Particle Systems.
Should I be pre calculating most things on the CPU and sending it to the GPU this is includes stuff like positions, rotations, velocity, calculations for alpha and random numbers ect.
Or should I be doing as much as i can in the shaders and using the geometry shader as much as possible.
My problem is that the sort of app that I have written has to have a good few variables sent to the shaders for example, A user at run time will select emitter positions and velocity plus a lot more. The sorts of things that I am not sure how to tackle are things like "if a user wants a random velocity and gives a min and max value to have the random value select from, should this random value be worked out on the CPU and sent as a single value to the GPU or should both the min and max values be sent to the GPU and have a random function generator in the GPU do it? Any comments on reducing bandwidth and optimization are much appreciated.
Should I be pre calculating most things on the CPU and sending it to the GPU this is includes stuff like positions, rotations, velocity, calculations for alpha and random numbers ect.
Or should I be doing as much as i can in the shaders and using the geometry shader as much as possible.
Impossible to answer. Spend too much CPU time and performance will drop. Spend too much GPU time, performance will drop too. Transfer too much data, performance will drop. So, instead of trying to guess (I don't know what app you're writing, what's your target hardware, etc. Hell, you didn't even specify your target api and platform) measure/profile and select optimal method. PROFILE instead of trying to guess the performance. There are AQTime 7 Standard, gprof, and NVPerfKit for that (plus many other tools).
Do you actually have performance problem in your application? If you don't have any performance problems, then don't do anything. Do you have, say ten million particles per frame in real time? If not, there's little reason to worry, since a 600mhz cpu was capable of handling thousand of them easily 7 years ago. On other hand, if you have, say, dynamic 3d environmnet and particles must interact with it (bounce), then doing it all on GPU will be MUCH harder.
Anyway, to me it sounds like you don't have to optimize anything and there's no actual NEED to optimize. So the best idea would be to concentrate on some other things.
However, in any case, ensure that you're using correct way to transfer "dynamic" data that is frequently updated. In directX that meant using dynamic write-only vertex buffers with D3DLOCK_DISCARD|D3DLOCK_NOOVERWRITE. With OpenGL that'll probably mean using STREAM or DYNAMIC bufferdata with DRAW access. That should be sufficient to avoid major performance hits.
There's no single right answer to this. Here are some things that might help you make up your mind:
Are you sure the volume of data going over the bus is high enough to be a problem? You might want to do the math and see how much data there is per second vs. what's available on the target hardware.
Is the application likely to be CPU bound or GPU bound? If it's already GPU bound there's no point loading it up further.
Particle systems are pretty easy to implement on the CPU and will run on any hardware. A GPU implementation that supports nontrivial particle systems will be more complex and limited to hardware that supports the required functionality (e.g. stream out and an API that gives access to it.)
Consider a mixed approach. Can you split the particle systems into low complexity, high bandwidth particle systems implemented on the GPU and high complexity, low bandwidth systems implemented on the CPU?
All that said, I think I would start with a CPU implementation and move some of the work to the GPU if it proves necessary and feasible.

How can I best improve the execution time of a bicubic interpolation algorithm?

I'm developing some image processing software in C++ on Intel which has to run a bicubic interpolation algorithm on small (about 1kpx) images over and over again. This takes a lot of time, and I'm aiming to speed it up. What I have now is a basic implementation based on the literature, a somewhat-improved (with regard to speed) version which doesn't do matrix multiplication, but rather uses pre-calculated formulas for parts of the interpolating polynomial and last, a fixed-point version of the matrix-multiplying code (works slower actually). I also have an external library with an optimized implementation, but it's still too slow for my needs. What I was considering next is:
vectorization using MMX/SSE stream processing, on both the floating and fixed-point versions
doing the interpolation in the Fourier domain using convolution
shifting the work onto a GPU using OpenCL or similar
Which of these approaches could yield greatest performance gains? Could you suggest another? Thanks.
I think GPU is the way to go. It's probably the most natural task for this type of hardware. I would start by looking into CUDA or OpenCL. Older techniques like simple DirectX/OpenGL pixel/fragment shaders should work just fine as well.
Some links I found, maybe they could help you:
Efficient GPU-Based Texture Interpolation using Uniform B-Splines
CUDA Cubic B-Spline Interpolation (CI)
Fast Third-Order Texture Filtering
There's the Intel IPP libraries, which use SIMD internally for faster processing. The Intel IPP also uses OpenMP, if configured, you can gain benefit of relatively easy multiprocessing.
These libraries do support bicubic interpolation and are payware (you buy a development license but redistribs are free).
Be careful with going the GPU route. If your convolution kernel is too fast, you're going to end up being IO bound. You won't know for sure which is the fastest unless you implement both.
GPU Gems 2 has a chapter on Fast Third-Order Texture Filtering which should be a good starting point for your GPU solution.
A combination of Intel Threading Building Blocks and SSE instructions would make a decent CPU solution.
Not an answer for bicubic, but maybe an alternative:
if I understand you, you have 32 x 32 xy, 1024 x 768 image, and want interpolated image[xy].
Just rounding xy, image[ int( xy )], would be too grainy.
But wait — you could make a smoothed double image 2k x 1.5k, once, and take
image2[ int( 2*xy )]: less grainy, very fast. Or similarly,
image4[ int( 4*xy )] in a smoothed 4k x 3k image.
How well this works depends on ...