Optimize calculation of white pixels in a binary image - c++

I have a program which does the following steps (using OpenCV):
Connect to a camera
Start a loop
Fetch frame
Extract red channel
Threshold the extracted channel
Put it into a deque to build a buffer (right now, a three image buffer)
Calculate the variation among frames in the buffer (some morphology included)
Take that variation as a binary image
Count the amount of variation (white pixels)
If there's variation, calculate its center.
My problem is that the loop that starts with the second step, is ideally repeated 90 times a second, and the CPU it is running on is quite weak (Raspberry PI), and so I decided to benchmark the application once it bottle necked.
I broke things up into four groups. Steps 3, 4-6, 7-8 and 9. Here are some results in microseconds (benchmarks based on the system time, not CPU time, they are not 100% precise)
Read camera:5101; Update buffer:15032; Calculate the variation:8149; Count non-zero:51665
Read camera:5446; Update buffer:16335; Calculate the variation:8365; Count non-zero:50005
Read camera:5394; Update buffer:15423; Calculate the variation:7163; Count non-zero:43006
Read camera:7527; Update buffer:20051; Calculate the variation:7919; Count non-zero:54895
Read camera:5492; Update buffer:16657; Calculate the variation:7757; Count non-zero:1034739
So it takes 5 to 7.5ms read a frame, 15 to 20ms to apply some processing and update a buffer, 7 to 8.5ms to calculate the buffer variation, and then 45ms to a second to count the amount of variation.
It spikes quite often in the last step, so that 1 second is not uncommon.
Why is it taking so much in the last step? It is a single line of code:
variatedPixels = countNonZero(variation);
With a best case scenario of 72ms, (27ms for the first steps + 45mn for the last) I'm nowhere close to being able to process 90 frames a second, and these are timings on an overclocked RPi2. That's definitely way too optimistic for the PI.
The worst I can take are 30 FPS for the application to work, but in that case it can't drop a single frame. That means having code executing in less than 33ms.
Is there any way to reproduce that line in less than 6ms? It doesn't really seem to do that much comparing to the remaining code, something just doesn't feels right. And why does it sometimes peaks? Can it be due to a thread change?
The ideas I have so far are:
To make the program multi-threaded. (It doesn't really need to answer
in real-time, just can't drop frames. There's a 400ms window to
display the results)
Reduce the bit depth from 8bits to 3 bits after
thresholding (it can lead to wrong results and no performance
benefit).
Since I'm new to C++ I would like to avoid complex solutions such as multi-threading.
EDIT:
Here is my code: https://gist.github.com/anonymous/90570c37f175fd2461b4
That's already cleaned out to be straight to the problem.
I'm probably messing up with the pointers, but it works. Please tell me in case something there is obviously wrong since I'm new, and hope the code not to be that awful. :)
EDIT 2:
I fixed a little bug with the measurement while cleaning up the code. Step 10 was always being executed, also it was being included by mistake under the step 9 times.
It also seems that having 5-6 "imshows" being updated at every second takes a lot of CPU on the PI. (I neglected that since in the desktop it wasn't even taking 1% CPU to display the frames to debug).
Right now I think I'm at 25-35ms. Need a a little more optimization to ensure it always works. So far the detection rate of my algorithm seems to be close to ~80%.

Related

How to process data at less than camera's frame per second ability?

i am not sure of how to put my question properly so here it goes.
I am running an object detection algorithm which runs at 40 frame per seconds (fps) and fitted on a camera which acts as an 'eye' on a robot. Then, I process the information which is received from the algorithm and pass the actions to my robot.
The issue is each time, the algorithm runs, it gives me slightly new reading. I guess its because as it processes data every 40 times per second, it will give new information. But I don't need new information if my robot doesn't move as most of the objects are at the same position at the previous frame.
My question, how can i only enhance my algorithm to only give me information each time if there is a change in object positions? by comparing last frame reading with current frame reading for example
I think you should try to find the motion estimation of the image ,I think MPG-4 video is using an algorithm like that.
http://www.img.lx.it.pt/~fp/cav/Additional_material/MPEG4_video.pdf
But if you don't want something so sophisticated and you just want to be see if the second image is the sane with the first one just substract them and see the differance. You can also use a Gaussian filter to cut the high frequencies and subtract them and also put a threshhold to check if you want do the procesing or not

Detect if images are different in real-time

I am working on a microscope that streams live images via a built-in video camera to a PC, where further image processing can be performed on the streamed image. Any processing done on the streamed image must be done in "real-time" (minimal frames dropped).
We take the average of a series of static images to counter random noise from the camera to improve the output of some of our image processing routines.
My question is: how do I know if the image is no longer static - either the sample under inspection has moved or rotated/camera zoom-in or out - so I can reset the image series used for averaging?
I looked through some of the threads, and some ideas that seemed interesting:
Note: using Windows, C++ and Intel IPP. With IPP the image is a byte array (Ipp8u).
1. Hash the images, and compare the hashes (normal hash or perceptual hash?)
2. Use normalized cross correlation (IPP has many variations - which to use?)
Which do you guys think is suitable for my situation (speed)?
If you camera doesn't shake, you can, as inVader said, subtract images. Then a sum of absolute values of all pixels of the difference image is sometimes enough to tell if images are the same or different. However, if your noise, lighting level, etc... varies, this will not give you a good enough S/N ratio.
And in noizy conditions normal hashes are even more useless.
The best would be to identify that some features of your object has changed, like it's boundary (if it's regular) or it's mass center (if it's irregular). If you have a boundary position, you'll need to analyze just one line of pixels, perpendicular to that boundary, to tell that boundary has moved.
Mass center position may be a subject to frequent false-negative responses, but adding a total mass and/or moment of inertia may help.
If the camera shakes, you may have to align images before comparing (depending on comparison method and required accuracy, a single pixel misalignment might be huge), and that's where cross-correlation helps.
And further, you doesn't have to analyze each image. You can skip one, and if the next differs, discard both of them. Here you have twice as much time to analyze an image.
And if you are averaging images, you might just define an optimal amount of images you need and compare just the first and the last image in the sequence.
So, simplest thing to try would be to take subsequent images, subtract them from each other and have a look at the difference. Then define some rules including local and global thresholds for the difference in which two images are considered equal. Simple subtraction of bitmap/array data, looking for maxima and calculating the average differnce across the whole thing should be ne problem to do in real time.
If there are varying light conditions or something moving in a predictable way(like a door opening and closing), then something more powerful, albeit slower, like gaussian mixture models for background modeling, might be worth looking into, click here. It is quite compute intensive, but can be parallelized pretty easily.
Motion detection algorithms is what is used.
http://www.codeproject.com/Articles/10248/Motion-Detection-Algorithms
http://www.codeproject.com/Articles/22243/Real-Time-Object-Tracker-in-C
First of all I would take a series of images at a slow fps rate and downsample those images to make them smaller, not too much but enough to speed up the process.
Now you have several options:
You could make a sum of absolute differences of the two images by subtracting them and use a threshold to value if the image has changed.
If you want to speed it up even further I would suggest doing a progressive SAD using a small kernel and moving from the top of the image to the bottom. You can value the complessive amount of differences during the process and eventually stop when you are satisfied.

Lowpass FIR Filter with FFT Convolution - Overlap add, why and how

First off, sorry for not posting the code here. For some reason all the code got messed upp when i tried to enter the code i had onto this page, and it probably was too much anyhow to post, to be acceptable. Here is my code: http://pastebin.com/bmMRehbd
Now from what im being told, the reason why i can't get a good result out of this code is because i'm not using overlap add. I have tried to read on several sources on the internet as to why i need to use overlap add, but i can't understand it. It seems like the actuall filter works, cause anything above the given cutoff, gets indeed cutoff.
I should mention this is code made to work for vst2-sdk.
Can someone tell me why i need to add it and how i can implement a overlap add code into the given code?
I should also mention that i'm pretty stupid when it comes to algoritms and maths. I'm one of those persons who need to visually get a grip of what i'm doing. That or getting stuff explained by code :), and then i mean the actual overlap.
Overlad add theory: http://en.wikipedia.org/wiki/Overlap%E2%80%93add_method
Thanks for all the help you can give!
The overlap-add method is needed to handle the boundaries of each fft buffer. The problem is that multiplication in the FFT domain results in circular convolution in the time domain. This means that after perfoming the IFFT, the results at the end of the frame wrap around and corrupt the output samples at the beginning of the frame.
It may be easier to think about it this way: Say you have a filter of length N. Linear convolution of this filter with M input samples actually returns M+N-1 output samples. However, the circular convolution done in the FFT domain results in the same number of input and output samples, M. The extra N-1 samples from linear convolution have "wrapped" around and corrupted the first N-1 output samples.
Here's an example (matlab or octave):
a = [1,2,3,4,5,6];
b = [1,2,1];
conv(a,b) %linear convolution
1 4 8 12 16 20 17 6
ifft(fft(a,6).*fft(b,6)) %circular convolution
18 10 8 12 16 20
Notice that the last 2 samples have wrapped around and added to the first 2 samples in the circular case.
The overlap-add/overlap-save methods are basically methods of handling this wraparound. The overlap of FFT buffers is needed since circular convolution returns fewer uncorrupted output samples than the number of input samples.
When you do a convolution (with a finite impulse response filter) by taking the inverse discrete Fourier transform of the product of the discrete Fourier transforms of two input signals, you are really implementing circular convolution. I'll hereby call this "convolution computed in the frequency domain." (If you don't know what a circular convolution is, look at this link. It's basically a convolution where you assume the domain is circular, i.e., shifting the signal off the sides makes it "wrap around" to the other side of the domain.)
You generally want to perform convolution by using fast Fourier transforms for large signals because it's computationally more efficient.
Overlap add (and its cousin Overlap save) are methods that work around the fact the convolutions done in the frequency domain are really circular convolutions, but that in reality we rarely ever want to do circular convolution, but typically rather linear convolutions.
Overlap add does it by "zero-padding" chunks of the input signal and then approrpriately using the portion of the circular convolutions (that were done in the frequency domain) appropriately. Overlap save does it by only keeping the portion of the signal that corresponds to linear convolution and tossing the part that was "corrupted" by the circular shifts.
Here are two links for from Wikipedia for both methods.
Overlap-add : This one has a nice figure explaining what's going on.
Overlap-save
This book by Orfanidis explains it well. See section 9.9.2. It's not the "de facto" standard on signal processing, but it's extremely well written and is a better introduction than other books, in my opinion.
First, understand that convolution in the time domain is equivalent to multiplication in the frequency domain. In convolution, you are at roughly O(n*m) where n is the FIR length and m is the number of samples to be filtered. In the frequency domain, using the FFT, you are running a O(n * log n). For large enough n, the cost of filtering is substantially less when doing it the frequency domain. If n is relatively small, however, the benefits decrease to the point its simpler to filter in the time domain. This breakpoint is subjective, however, figure 50 to 100 as being the point where you might switch.
Yes, a convolution filter will "work", in term of changing the frequency response. But this multiplication in the frequency domain will also contaminate time-domain data at one end with data from the other end, and vice-versa. Overlap add/save extends the FFT size and chops off the "contaminated" end, and then uses that end data to fix the beginning of the subsequent FFT window.

Optimal code/algorithm to convert frame count to timecode?

I am searching for optimal sourcecode/algorithm in c++
to convert frames count into time code hh:mm:ss:ff in given fps, ex. 25fps...
this code is preety good - http://www.andrewduncan.ws/Timecodes/Timecodes.html (bottom of page)
but it is expensive - it contains 4 mod and 6 div operations
I need to show time code on every frame, so calculating this algorithm
could take some time.
I can of course store evaluated timecode to avoid calculations.
But it would be very helpful to know better algorithm...
thanks in advance
yours
m.
General rule of thumb: In image-processing systems, which include video players, you sweat blood over the operations that run once per pixel, then you sweat over the operations that run once per image "patch" (typically a line of pixels), and you don't sweat the stuff that runs once per frame.
The reason is that the per-pixel stuff will run hundreds, maybe thousands, of times as often as the per-patch stuff, and the per-patch stuff will run hundreds, maybe thousands, of times as often as the per-frame stuff.
This means that the per-pixel stuff may run millions of times as often as the per-frame stuff. One instruction per pixel may cost millions of instructions per frame, and a few hundred, or even a few thousand, instructions per frame is lost in the noise floor against the per-pixel instruction counts.
In other words, you can probably afford the mods and divs.
Having said that, it MIGHT be reasonable to run custom counters instead of doing mods and divs.
First of all, the question might better be optimal code for conversion between seconds to hours, minutes, seconds. At this point, if frames come in order, you can simply use addition to increase the previous time.
First off, I agree with everyone else that you probably don't need to optimize this, unless you are specifically seeing problems. However, since it's entertaining to try to find ways to, I'll give you something I saw at first glance to reduce the number of divides.
seconds = framenumber div 30
minutes = seconds div 60
hours = minutes div 60
frames = frameNumber mod 30
seconds = seconds mod 60
minutes = minutes mod 60
hours = hours mod 24
It's more lines of code, but fewer divides. Basically, since seconds, minutes and hours use some of the same math, I use the results from one in the formula for the next.
Mod and div operations (to a small constant value) may be effectively performed with multiplication to some precalculated reciprocal. So they are not expensive.

detecting pauses in a spoken word audio file using pymad, pcm, vad, etc

First I am going to broadly state what I'm trying to do and ask for advice. Then I will explain my current approach and ask for answers to my current problems.
Problem
I have an MP3 file of a person speaking. I'd like to split it up into segments roughly corresponding to a sentence or phrase. (I'd do it manually, but we are talking hours of data.)
If you have advice on how to do this programatically or for some existing utilities, I'd love to hear it. (I'm aware of voice activity detection and I've looked into it a bit, but I didn't see any freely available utilities.)
Current Approach
I thought the simplest thing would be to scan the MP3 at certain intervals and identify places where the average volume was below some threshold. Then I would use some existing utility to cut up the mp3 at those locations.
I've been playing around with pymad and I believe that I've successfully extracted the PCM (pulse code modulation) data for each frame of the mp3. Now I am stuck because I can't really seem to wrap my head around how the PCM data translates to relative volume. I'm also aware of other complicating factors like multiple channels, big endian vs little, etc.
Advice on how to map a group of pcm samples to relative volume would be key.
Thanks!
PCM is a time frame base encoding of sound. For each time frame, you get a peak level. (If you want a physical reference for this: The peak level corresponds to the distance the microphone membrane was moved out of it's resting position at that given time.)
Let's forget that PCM can uses unsigned values for 8 bit samples, and focus on
signed values. If the value is > 0, the membrane was on one side of it's resting position, if it is < 0 it was on the other side. The bigger the dislocation from rest (no matter to which side), the louder the sound.
Most voice classification methods start with one very simple step: They compare the peak level to a threshold level. If the peak level is below the threshold, the sound is considered background noise.
Looking at the parameters in Audacity's Silence Finder, the silence level should be that threshold. The next parameter, Minimum silence duration, is obviously the length of a silence period that is required to mark a break (or in your case, the end of a sentence).
If you want to code a similar tool yourself, I recommend the following approach:
Divide your sound sample in discrete sets of a specific duration. I would start with 1/10, 1/20 or 1/100 of a second.
For each of these sets, compute the maximum peak level
Compare this maximum peak to a threshold (the silence level in Audacity). The threshold is something you have to determine yourself, based on the specifics of your sound sample (loudnes, background noise etc). If the max peak is below your threshold, this set is silence.
Now analyse the series of classified sets: Calculate the length of silence in your recording. (length = number of silent sets * length of a set). If it is above your Minimum silence duration, assume that you have the end of a sentence here.
The main point in coding this yourself instead of continuing to use Audacity is that you can improve your classification by using advanced analysis methods. One very simple metric you can apply is called zero crossing rate, it just counts how often the sign switches in your given set of peak levels (i.e. your values cross the 0 line). There are many more, all of them more complex, but it may be worth the effort. Have a look at discrete cosine transformations for example...
Just wanted to update this. I'm having moderate success using Audacity's Silence Finder. However, I'm still interested in this problem. Thanks.
PCM is a way of encoding a sinusoidal wave. It will be encoded as a series of bits, where one of the bits (1, I'd guess) indicates an increase in the function, and 0 indicates a decrease. The function can stay roughly constant by alternating 1 and 0.
To estimate amplitude, plot the sin wave, then normalize it over the x axis. Then, you should be able to estimate the amplitude of the sin wave at different points. Once you've done that, you should be able to pick out the spots where amplitude is lower.
You may also try to use a Fourier transform to estimate where the signals are most distinct.