Using multithreading in c++11 to fully utilize CPU with OpenCV - c++

I am writing a console C++ program that analyzes films for color (and other properties) using the OpenCV Library. I have reached a significant bottleneck with the cap.retrieve() function (similar to cap.read()). This operation takes the longest out of any single function I call in my program, and takes significantly longer when reading HD videos. Despite this, I am still getting less than 50% CPU utilization when I run my program.
I have since decided that the best course of action would be to try to reach full utilization by creating a new thread each time I want to read (or "retrieve) an image from the video, with a specific maximum number of threads based upon the CPU specs. I have done some reading on the basics of multithreading with C++11, but am unsure where to start.
Here is the section of my code I would like to multithread:
// Run until all frames have been read
while(cap.get(CV_CAP_PROP_POS_FRAMES) != cap.get(CV_CAP_PROP_FRAME_COUNT)) {
// look at but do not read the frame
cap.grab();
// only process if user wants to process it
// eg: if resolution = 20 then will only process every 20 frames
if (int(cap.get(CV_CAP_PROP_POS_FRAMES)) % resolution == 0) {
// multithread everything inside this loop??
Mat frame;
// retrieve frame and get data to solve for avg time to retrieve
double t1 = (double)getTickCount();
bool bSuccess = cap.retrieve(frame);
t1 = ((double)getTickCount() - t1)/getTickFrequency();
readT.push_back(t1);
//if not success, break loop
if (!bSuccess) {
break;
}
// Get avg color and data to calculate total avg color
// get data to solve for avg time to calc mean
HIDDEN
// adds a single row of the avg color to the colorCloud mat
HIDDEN
}
}
Thanks in advance for your help. Anything from links to resources to tutorials or pseudocode would be greatly appreciated!

Related

How to reduce OpenGL CPU usage and/or how to use OpenGL properly

I'm working a on a Micromouse simulation application built with OpenGL, and I have a hunch that I'm not doing things properly. In particular, I'm suspicious about the way I am getting my (mostly static) graphics to refresh at a close-to-constant framerate (60 FPS). My approach is as follows:
1) Start a timer
2) Draw my shapes and text (about a thousand of them):
glBegin(GL_POLYGON);
for (Cartesian vertex : polygon.getVertices()) {
std::pair<float, float> coordinates = getOpenGlCoordinates(vertex);
glVertex2f(coordinates.first, coordinates.second);
}
glEnd();
and
glPushMatrix();
glScalef(scaleX, scaleY, 0);
glTranslatef(coordinates.first * 1.0/scaleX, coordinates.second * 1.0/scaleY, 0);
for (int i = 0; i < text.size(); i += 1) {
glutStrokeCharacter(GLUT_STROKE_MONO_ROMAN, text.at(i));
}
glPopMatrix();
3) Call
glFlush();
4) Stop the timer
5) Sleep for (1/FPS - duration) seconds
6) Call
glutPostRedisplay();
The "problem" is that the above approach really hogs my CPU - the process is using something like 96-100%. I know that there isn't anything inherently wrong with using lots of CPU, but I feel like I shouldn't be using that much all of the time.
The kicker is that most of the graphics don't change from frame to frame. It's really just a single polygon moving over (and covering up) some static shapes. Is there any way to tell OpenGL to only redraw what has changed since the previous frame (with the hope it would reduce the number of glxxx calls, which I've deemed to be the source of the "problem")? Or, better yet, is my approach to getting my graphics to refresh even correct?
First and foremost the biggest CPU hog with OpenGL is immediate modeā€¦ and you're using it (glBegin, glEnd). The problem with IM is, that every single vertex requires a whole couple of OpenGL calls being made; and because OpenGL uses a thread local state this means that each and every OpenGL call must go through some indirection. So the first step would be getting rid of that.
The next issue is with how you're timing your display. If low latency between user input and display is not your ultimate goal the standard approach would setting up the window for double buffering, enabling V-Sync, set a swap interval of 1 and do a buffer swap (glutSwapBuffers) once the frame is rendered. The exact timings what and where things will block are implementation dependent (unfortunately), but you're more or less guaranteed to exactly hit your screen refresh frequency, as long as your renderer is able to keep up (i.e. rendering a frame takes less time that a screen refresh interval).
glutPostRedisplay merely sets a flag for the main loop to call the display function if no further events are pending, so timing a frame redraw through that is not very accurate.
Last but not least you may be simply mocked by the way Windows does account CPU time (time spent in driver context, which includes blocking, waiting for V-Sync) will be accouted to the consumed CPU time, while it's in fact interruptible sleep. However you wrote, that you already do a sleep in your code, which would rule that out, because the go-to approach to get a more reasonable accounting would be adding a Sleep(1) before or after the buffer swap.
I found that by putting render thread to sleep helps reducing cpu usage from (my case) 26% to around 8%
#include <chrono>
#include <thread>
void render_loop(){
...
auto const start_time = std::chrono::steady_clock::now();
auto const wait_time = std::chrono::milliseconds{ 17 };
auto next_time = start_time + wait_time;
while(true){
...
// execute once after thread wakes up every 17ms which is theoretically 60 frames per
// second
auto then = std::chrono::high_resolution_clock::now();
std::this_thread::sleep_until(next_time);
...rendering jobs
auto elasped_time =
std::chrono::duration_cast<std::chrono::milliseconds> (std::chrono::high_resolution_clock::now() - then);
std::cout << "ms: " << elasped_time.count() << '\n';
next_time += wait_time;
}
}
I thought about attempting to measure the frame rate while the thread is asleep but there isn't any reason for my use case to attempt that. The result was averaging around 16ms so I thought it was good enough
Inspired by this post

OpenCV - not able to grab all frames

I have a very basic question about frame capturing using OpenCV. My code look like below:
VideoCapture cap(0);
cv::Mat mat;
int i = 0;
while(cap.read(mat)==true) {
//some code here
i = i + 1;
}
It works well. However, when I look at logcat logs by OpenCV, it says
FRAMES Received 225, grabbed 123.
and this grabbed (123) usually matches with the variable 'i' (number of loops) in my code.
Ideally my code should be able to read all received frames, isn't it? Can someone explain this behavior?
Calling cap.read(mat) takes a certain amount of time as it has to obtain and decode the image's video feed and convert it to the cv::Mat format. This amount of time appears to be greater than the video's capture rate. You can determine the frames per second of the video capture with the following:
double frames_per_second = cap.get(CV_CAP_PROP_FPS);
Try timing the amount of time your cap.read(mat) call takes and see if you can see a relationship between the ratio of frames received to frames grabbed and the ratio of the capture time (1/frames_per_second) and the time cap.read(mat) takes to execute.
Source:
http://opencv-srf.blogspot.ca/2011/09/capturing-images-videos.html

plotting real time Data on (qwt )Oscillocope

I'm trying to create a program, using Qt (c++), which can record audio from my microphone using QAudioinput and QIODevice.
Now, I want to visualize my signal
Any help would be appreciated. Thanks
[Edit1] - copied from your comment (by Spektre)
I Have only one Buffer for both channel
I use Qt , the value of channel are interlaced on buffer
this is how I separate values
for ( int i = 0, j = 0; i < countSamples ; ++j)
{
YVectorRight[j]=Samples[i++];
YVectorLeft[j] =Samples[i++];
}
after I plot YvectorRight and YvectorLeft. I don't see how to trigger only one channel
hehe done this few years back for students during class. I hope you know how oscilloscopes works so here are just the basics:
timebase
fsmpl is input signal sampling frequency [Hz]
Try to use as big as possible (44100,48000, ???) so the max frequency detected is then fsmpl/2 this gives you the top of your timebase axis. The low limit is given by your buffer length
draw
Create function that will render your sampling buffer from specified start address (inside buffer) with:
Y-scale ... amplitude setting
Y-offset ... Vertical beam position
X-offset ... Time shift or horizontal position
This can be done by modification of start address or by just X-offsetting the curve
Level
Create function which will emulate Level functionality. So search buffer from start address and stop if amplitude cross Level. You can have more modes but these are basics you should implement:
amplitude: ( < lvl ) -> ( > lvl )
amplitude: ( > lvl ) -> ( < lvl )
There are many other possibilities for level like glitch,relative edge,...
Preview
You can put all this together for example like this: you have start address variable so sample data to some buffer continuously and on timer call level with start address (and update it). Then call draw with new start address and add timebase period to start address (of course in term of your samples)
multichannel
I use Line IN so I have stereo input (A,B = left,right) therefore I can add some other stuff like:
Level source (A,B,none)
render mode (timebase,Chebyshev (Lissajous curve if closed))
Chebyshev = x axis is A, y axis is B this creates famous Chebyshev images which are good for dependent sinusoidal signals. Usually forming circles,ellipses,distorted loops ...
miscel stuff
You can add filters for channels emulating capacitance or grounding of input and much more
GUI
You need many settings I prefer analog knobs instead of buttons/scrollbars/sliders just like on real Oscilloscope
(semi)Analog values: Amplitude,TimeBase,Level,X-offset,Y-offset
discrete values: level mode(/,),level source(A,B,-),each channel (direct on,ground,off,capacity on)
Here are some screenshots of my oscilloscope:
Here is screenshot of my generator:
And finally after adding some FFT also Spectrum Analyser
PS.
I started with DirectSound but it sucks a lot because of buggy/non-functional buffer callbacks
I use WinAPI WaveIn/Out for all sound in my Apps now. After few quirks with it, is the best for my needs and has the best latency (Directsound is too slow more than 10 times) but for oscilloscope it has no merit (I need low latency mostly for emulators)
Btw. I have these three apps as linkable C++ subwindow classes (Borland)
and last used with my ATMega168 emulator for my sensor-less BLDC driver debugging
here you can try my Oscilloscope,generator and Spectrum analyser If you are confused with download read the comments below this post btw password is: "oscill"
Hope it helps if you need help with anything just comment me
[Edit1] trigger
You trigger all channels at once but the trigger condition is checked usually just from one Now the implementation is simple for example let the trigger condition be the A(left) channel rise above level so:
first make continuous playback with no trigger you wrote it is like this:
for ( int i = 0, j = 0; i < countSamples ; ++j)
{
YVectorRight[j]=Samples[i++];
YVectorLeft[j] =Samples[i++];
}
// here draw or FFT,draw buffers YVectorRight,YVectorLeft
Add trigger
To add trigger condition you just find sample that meets it and start drawing from it so you change it to something like this
// static or global variables
static int i0=0; // actual start for drawing
static bool _copy_data=true; // flag that new samples need to be copied
static int level=35; // trigger level value datatype should be the same as your samples...
int i,j;
for (;;)
{
// copy new samples to buffer if needed
if (_copy_data)
for (_copy_data=false,i=0,j=0;i<countSamples;++j)
{
YVectorRight[j]=Samples[i++];
YVectorLeft[j] =Samples[i++];
}
// now search for new start
for (i=i0+1;i<countSamples>>1;i++)
if (YVectorLeft[i-1]<level) // lower then level before i
if (YVectorLeft[i]>=level) // higher then level after i
{
i0=i;
break;
}
if (i0>=(countSamples>>1)-view_samples) { i0=0; _copy_data=true; continue; }
break;
}
// here draw or FFT,draw buffers YVectorRight,YVectorLeft from i0 position
the view_samples is the viewed/processed size of data (for one or more screens) it should be few times less then the (countSamples>>1)
this code can loose one screen on the border area to avoid that you need to implement cyclic buffers (rings) but for starters is even this OK
just encode all trigger conditions through some if's or switch statement

TBB parallel_pipeline tokens seem to be occasionally out of order

I have recently started using tbb version 4.0+r233-1 on Ubuntu 12.04 to accelerate a video panorama stitcher. The bug I'm seeing is kind of strange, and was hoping someone could shed some light on the problem.
What appears to happen is that the tokens are reaching the sink node out of order (although I find it hard to believe that's actually a bug in TBB). I'm seeing jitter in the blended video frame (e.g., blended frame N + 3 is being shown when blended frame N should be displayed, which causes the video to appear to stutter). I know it has something to do with the parallel filters because if I set the number tokens in flight to 1 instead of 4 the stuttering no longer happens.
My pipeline is architected as follows:
Read Frames Vector from files (serial) -> Warp Frames Vector (parallel) -> Blend Frames Vector (parallel) -> Write Blended Frame to file (serial)
Below are the relevant pieces of code, I believe show the problem areas:
PipelineStitcher.h
class PipelinedStitcher {
public:
PipelinedStitcher(
const std::string& projectFilename,
const std::string& outputFilename,
double scaleFactor);
...
void run();
private:
std::vector<PanoramaParameters> panoParams;
std::vector<cv::Mat> readFramesFromVideos();
std::vector<cv::Mat> warpFrames(const std::vector<cv::Mat>& frames);
cv::Mat blendFrames(std::vector<cv::Mat>& warpedFrames);
};
PipelineStitcher::run()
void PipelinedStitcher::run()
{
parallel_pipeline( 4,
make_filter< void, std::vector<Mat> > (
tbb::filter::serial,
[&](flow_control & fc)-> std::vector<Mat>
{
vector<Mat> frames = readFramesFromVideos();
if(frames.empty())
{
fc.stop();
}
return frames;
}
) &
make_filter< std::vector<Mat>, std::vector<Mat> > (
tbb::filter::parallel,
[&](std::vector<Mat> src) {
vector<Mat> dst = warpFrames(src);
return dst;
}
) &
make_filter< std::vector<Mat>, Mat > (
tbb::filter::parallel,
[&](std::vector<Mat> src) {
Mat dst = blendFrames(src);
return dst;
}
) &
make_filter<Mat, void> (
tbb::filter::serial,
[&](Mat src) {
if(!videoWriter.isOpened())
{
videoWriter.open(outputFilename, CV_FOURCC('D','I','V','X'), 30.0, src.size(), true);
}
videoWriter << src;
imshow("panoramic view", src);
waitKey(3);
}
)
);
videoWriter.release();
}
A few questions:
Since warpFrames, and blendFrames both access the member variable
vector<PanoramaParameters> panoParams, should this member be a
concurrent_vector type? These parameters are created once in the
constructor and never updated.
What might cause the blended frame to reach the end filter out of
order with TBB?
Update 06/19/13:
Thanks to #AlexeyKukanov I was able to prove that the tokens are definitely arriving in order. What appears to happen is that either the source or the sink filters have buffering issues when all CPU cores are at 100% utilization. I have a 4-core processor, which once I allow 4 tokens in flight the CPU is completely saturated and the stuttering starts. However, when 1, 2, or 3 tokens are in flight there doesn't appear to be any stuttering.
Any help would be greatly appreciated!
It's rather a collection of knowledge and advice pieces than an answer, but it gets too long for a comment. My previous comments are also copied here.
The TBB usage in the code seems correct. To figure out if the root cause is in TBB or elsewhere in your code, I recommend to check if the frames really go out of order in the last filter, e.g. by printing order IDs assigned in the first filter. Since TBB does not expose internal token IDs, you have to assign and track IDs on your own.
Also FYI, the number of tokens does not have to be equal to the number of HW cores. Though this number effectively limits the concurrency, it's there primarily to prevent getting short of resources (e.g. memory) when lots of tokens wait for their turn in a serial filter.
Another thing to know is that it's unspecified which thread executes which filter; in fact, any thread can execute any filter. So if, for example, the sink filter draws something on the screen, you need to make sure that drawing can be done by any thread, or otherwise redirect all drawing to a single thread. As far as I know, some GUI frameworks may require all drawing to be done by a single thread, or some initialization routines being called in each thread prior to drawing.

Optical flow using opencv

I am using Pyramid Lukas Kanade function of OpenCV to estimate the optical flow. i call the cvGoodFeaturesToTrack and then cvCalcOpticalFlowPyrLK. This is my code:
while(1)
{
...
cvGoodFeaturesToTrack(frameAth,eig_image,tmp_image,cornersA,&corner_count,0.01,5,NULL,3,0.4);
std::cout<<"CORNER COUNT AFTER GOOD FEATURES2TRACK CALL = "<<corner_count<<std::endl;
cvCalcOpticalFlowPyrLK(frameAth,frameBth,pyrA,pyrB,cornersA,cornersB,corner_count,cvSize(win_size,win_size),5,features_found,features_errors,cvTermCriteria( CV_TERMCRIT_ITER| CV_TERMCRIT_EPS,20,0.3 ),CV_LKFLOW_PYR_A_READY|CV_LKFLOW_PYR_B_READY);
cvCopy(frameBth,frameAth,0);
...
}
frameAth is the previous gray frame and frameBth is the current gray frame from a webcam. But when i output the number of good features to track in each frame the number decreases after sum time and keeps decreasing. but if i terminate the program and execute the code again(without disturbing the field of view of the webcam ) a lot more number of points are shown as good features to track...how can for the same field of view and for the same scene the function give such difference in number of points...and the difference is high..eg..number of points as good features to track after 4 minutes of execution is 20 or 50...but when the same program terminated and executed again the number is 500 to 700 initialy but again slowly decreases..i am using opencv for the past 4 months so i am lil new to openCV..please guide me or tell me where i can find a solution...lots of thanx in advance..
You have to call cvGoodFeaturesToTrack once (at the beginning, before loop) to detect good features to track and than track these features using cvCalcOpticalFlowPyrLK. Take a look at default opencv example: OpenCV/samples/cpp/lkdemo.cpp.
You are calling cvGoodFeatureToTrack and passing corner_count by reference. Its value decreases if less features are found. You have to reset the value of corner_count to its initial value before calling cvGoodFeaturesToTrackin each iteration of while loop.