Improve performance of dense optical flow analysis (easily)? - c++

I wrote a program that uses OpenCV's cvCalcOpticalFlowLK. It performs fine on a low-resolution webcam input, but I need to run it on a full HD stream with significant other computation following the optical flow analysis for each frame. Processing a 5 minute video scaled down to 1440x810 took 4 hours :( Most of the time is being spent in cvCalcOpticalFlowLK.
I've researched improving the speed by adding more raw CPU, but even if I get an 8-core beast, and the speedup is the theoretical ideal (say 8x, since I'm basically only using one of my 2.9GHz cores), I'd only be getting 4FPS. I'd like to reach 30FPS.
More research seems to point to implementing it on the GPU with CUDA, OpenCL, or GLSL(?). I've found some proof-of-concept implementations (eg. http://nghiaho.com/?page_id=189), and many papers saying basically "it's a great application for the GPU, we did it, it was awesome, and no we won't share our code". Needless to say, I haven't gotten any of them to run.
Does anyone know of a GPU-based implementation that would run on Mac with an NVIDIA card? Are there resources that might help me approach writing my own? Are there other dense OF algorithms that might perform better?
Thanks!

What about OpenVidia Bayesian Optical Flow? Also the paper Real-Time Dense and Accurate Parallel Optical Flow using CUDA says that their work is freely available in the CUDA zone. I couldn't find it there immediately, but maybe you will, or can write the authors?

Related

Dense optical flow for real time on hardware for 4K resolution #30fps

I am working on a hardware-based solution ( without GPU) for dense optical flow to get real-time performance # 30fps with decent accuracy. Something comparable to or better than NVIDIA’s optical flow SDK. Can someone please suggest good algorithms other than Pyramidal Lukas Kanade and horn Schunck. I found SGM as a good starting point but it’s difficult to implement on FPGA or DSP core. The target is to measure large displacements with occlusion as well as similar to real-world videos.
It would be great if someone could tell what exactly algorithm NVIDIA has used.
For dense optical flow estimation in real-time setup, FlowNet is a good option. It can achieve optical flow estimation at a higher FPS. You can take their trained model to perform inference. Since you want to run the estimation in a non-GPU environment, you can try converting the model to ONNX format. A good implementation of FlowNet is available in NVIDIA's Github repo. I am not sure exactly which algorithm NVIDIA is using in its SDK for optical flow.
The FlowNet2 is built upon previous work of FlowNet to compute large displacement. However, if you are concerned about occlusion then you may check out their follow up work on FlowNet3. Another alternative to FlowNet is PwC-Net.

How to improve performance of an OpenCV algorithm written in C++

I am executing a computer vision program written in C++ and OpenCV on Raspberry Pi 3B and my camera module is picamera. Roughly I am calculating the deviation amount from the road and sending it to another platform.
Currently, my main methodology can not become simpler i.e, i can't remove any matrix operations. However, I need to increase my throughput further. For the time being, I receive 19-20 results per second. My camera FPS is set to 30.
I was wondering are there any way to increase throughput? For example, I have tried to use optimization levels on g++ (-O2, -O3) and didn't observe any increase.
Another option is to use multithreading since my throuhput is lower than camera FPS maybe i can capture another frame while other thread processing the already captured frame. However, I don't have any experience in multithreading therefore I wanted to ask whether anyone can approve this strategy or not since I have limited time I want to use my effort in the most fruitful way.
Any other suggestions are welcome. Thank you for your help.

Rotating hundreds of JPEGs in seconds rather than hours

We have hundreds of images which our computer gets at a time and we need to rotate and resize them as fast as possible.
Rotation is done by 90, 180 or 270 degrees.
Currently we are using the command line tool GraphicsMagick to rotate the image. Rotating the images (5760*3840 ~ 22MP) takes around 4 to 7 seconds.
The following python code sadly gives us equal results
import cv
img = cv.LoadImage("image.jpg")
timg = cv.CreateImage((img.height,img.width), img.depth, img.channels) # transposed image
# rotate counter-clockwise
cv.Transpose(img,timg)
cv.Flip(timg,timg,flipMode=0)
cv.SaveImage("rotated_counter_clockwise.jpg", timg)
Is there a faster way to rotate the images using the power of the graphics card? OpenCL and OpenGL come to mind but we are wondering whether a performance increase would be noticable.
The hardware we are using is fairly limited as the device should be as small as possible.
Intel Atom D525 (1,8 Ghz)
Mobility Radeon HD 5430 Series
4 GB of RAM
SSD Vertility 3
The software is debian 6 with official (closed source) radeon drivers.
you can perform a lossless rotation that will just modify the EXIF section. This will rotate your pictures faster.
and have a look at jpegtran utility which performs lossless jpeg modifications.
https://linux.die.net/man/1/jpegtran
There is a jpeg no-recompression plugin for irfanview which IIRC can rotate and resize images (in simple ways) without recompressing, it can also run an a directory of images - this should be a lot faster
The GPU probably wouldn't help, you are almost certainly I/O limited in opencv, it's not really optomised for high speed file access
I'm not an expert in jpeg and compression topics, but as your problem is pretty much as I/O limited as it gets (assuming that you can rotate without heavy de/encoding-related computation), you you might not be able to accelerate it very much on the GPU you have. (Un)Luckily your reference is a pretty slow Atom CPU.
I assume that the Radeon has separate main memory. This means that data needs to be communicated through PCI-E which is the extra latency compared to CPU execution and without hiding you can be sure that it is the bottleneck. This is the most probable reason why your code that uses OpenCV on the GPU is slow (besides the fact that you do two memory-bound operations, transpose & flip, instead of a single one).
The key thing is to hide as much of the PCI-E transfer times with computation as possible by using multiple-buffering. Overlapping transfers both to and from the GPU with computation by making use of the full-duplex capability of PCI-E will only work if the card in question has dual-DMA engines like high-end Radeons or the NVIDIA Quadro/Tesla cards -- which I highly doubt.
If your GPU compute-time (the time it takes the GPU to do the rotation) is lower than the time the transfer takes, you won't be able to fully overlap. The HD 4530 has a pretty slow memory interface with only 12.8 Gb/s peak, and the rotation kernel should be quite memory bound. However, I can only guesstimate, but I would say that if you reach peak PCI-E transfer rate of ~1.5 Gb/s (4x PCI-E AFAIK), the compute kernel will be a few times faster than the transfer and you'll be able to overlap very little.
You can simply time the parts separately without requiring elaborate asynchronous code and you can estimate how fast can you get things with an optimum overlap.
One thing you might want to consider is getting hardware which doesn't exhibit PCI-E as a bottleneck, e.g:
AMD APU-based system. On these platforms you will be able to page-lock the memory and use it directly from the GPU;
integrated GPUs which share main memory with the host;
a fast low-power CPU like a mobile Intel Ivy Bridge e.g. i5-3427U which consumes almost as little as the Atom D525 but has AVX support and should be several times faster.

Real time ray tracer

I would like to make a basic real time CPU ray tracer in C++ (mainly for learning proposes). This tutorial was great for making a basic ray tracer. But what would be the best solution to draw this on the screen in real time? I'm not asking on how to optimize the ray tracing-part, just the painting part so that it would paint on the screen and not in a file.
I'm developing on/for windows.
You could check out this Code Project article on the basic paint mechanism using Win32API
Update: OP wants fast drawing, which the Win32API does not provide. The OP needs this so that they can measure speedup of the ray-tracing algorithm during optimization process. Other possibilities for drawing are: DirectX, XNA, Allegro, OpenGL.
I'm professionally working on a realtime CPU raytracer, and from what I saw with 2 years of work there, the GPU part to display image won't be the bottleneck, the bottleneck if you reach it will be the speed of your RAM, I don't think the drawing technology will make any significant difference.
As an example, we are using clustering (one CPU is not enough :p), we were able to render 100-200fps at 1920x1080 when looking the sky but the bottleneck was not the display part, it was the network...
EDIT: We are using OpenGL for the display.
When you are doing a CPU raytracer you are not gonna do printPixelToGPU() but you will write to your RAM and then send it to the GPU once the image is finished. Doing printPixelToGPU() would probably cause an big overhead and it is (in my opinion) a really bad design choice.
It looks like premature optimization. But if you are still concerned about that, just do a bench of how many RAM textures to GPU transfer you can do with OpenGL, directX..., and print the average framerate. You will probably see that the framerate will be really really high, so you will certainly never reach that "bottleneck" unless you are using SDRAM or a really poor GPU.

Fastest deskew algorithm?

I am a little overwhelmed by my task at hand. We have a toolkit which we use for TWAIN scanning. Some of our customers are complaining about slower scan speeds when the deskew option is set. This is because if their scanner does not support a hardware deskew, it is done in post-processing on the CPU. I was wondering if anyone knows of a good (i.e. fast) algorithm to achieve this. It is hard for me to say what algorithm we are using now. What algorithms are out there for this, and how do they rank as far as speed/accuracy? If I knew the names of the algorithms, it could be easier for me to do a google search on them.
Thank You.
-Tom
Are you scanning in Color or B/W ?
Deskew is processor intensive. A Group4 tiff or JPEG must be decompressed, skew angle determined, deskewed and then compressed.
There are many image processing algorithms out there with deskew and I have evaluated many over the years. There are some huge differences in processing speed between the different libraries and a lot of it comes down to how well it is coded rather than the algorithm used. There is a huge difference in commercial libraries just reading and writing images.
The fastest commerical deskew I have used by far comes from Unisoft Imaging (www.unisoftimaging.com). I assume much of it is written in assembler. Unisoft has been around for many years and is very fast and efficient. It supports different many different deskew options including black border removal, color and B/W deskew. The Group4 routines are very solid and very fast. The library comes with many other image processing options as well as TWAIN and native SCSI scanner support. It also supports Unix.
If you want a free deskew then you might want to have a look at Leptonica. It does not come with too much documentation but is very stable and well written. http://www.leptonica.com/
Developing code from scratch could be quite time consuming and may be quite buggy and prone to errors.
The other option is to process the document in a separate process so that scanning can run at the speed of the scanner. At the moment you are probably processing everything in a parallel fashion, one task after another, hence the slowdown.
Consider doing it as post-processing, because deskew cannot be done at real-time (unless it's hardware accelerated).
Deskew consists of two steps: skew detection and rotation. Detecting the skew angle can usually be done on a B&W (1-bit) image faster. Rotation speed depends on the quality of the interpolation. A good quality deskew will take a lot of time to run, much more than scanning pages.
A good high speed scanner can do 120 double-sided pages per minute, if it has hardware JPEG or TIFF Group 4 compression, and your TWAIN library takes advantage of it (hint: do not use native mode). You barely have enough time to save the file to the hard drive at that speed, let alone decompress, skew detect, rotate, re-compress. Quality deskew takes several seconds per page, unless you can use the video card's hardware accelerator to rotate and compress.
Do I correctly understand you already have such algorithm implemented? If so, are you sure there is no space for optimization? I'd start with profiling existing solution.
Anyway, I guess you should look for fast digital Radon transform algorithm.
Take a look at http://pagetools.sourceforge.net. They have deskew algorithm implementation.