TBB parallel_pipeline tokens seem to be occasionally out of order - c++

I have recently started using tbb version 4.0+r233-1 on Ubuntu 12.04 to accelerate a video panorama stitcher. The bug I'm seeing is kind of strange, and was hoping someone could shed some light on the problem.
What appears to happen is that the tokens are reaching the sink node out of order (although I find it hard to believe that's actually a bug in TBB). I'm seeing jitter in the blended video frame (e.g., blended frame N + 3 is being shown when blended frame N should be displayed, which causes the video to appear to stutter). I know it has something to do with the parallel filters because if I set the number tokens in flight to 1 instead of 4 the stuttering no longer happens.
My pipeline is architected as follows:
Read Frames Vector from files (serial) -> Warp Frames Vector (parallel) -> Blend Frames Vector (parallel) -> Write Blended Frame to file (serial)
Below are the relevant pieces of code, I believe show the problem areas:
class PipelinedStitcher {
const std::string& projectFilename,
const std::string& outputFilename,
double scaleFactor);
void run();
std::vector<PanoramaParameters> panoParams;
std::vector<cv::Mat> readFramesFromVideos();
std::vector<cv::Mat> warpFrames(const std::vector<cv::Mat>& frames);
cv::Mat blendFrames(std::vector<cv::Mat>& warpedFrames);
void PipelinedStitcher::run()
parallel_pipeline( 4,
make_filter< void, std::vector<Mat> > (
[&](flow_control & fc)-> std::vector<Mat>
vector<Mat> frames = readFramesFromVideos();
return frames;
) &
make_filter< std::vector<Mat>, std::vector<Mat> > (
[&](std::vector<Mat> src) {
vector<Mat> dst = warpFrames(src);
return dst;
) &
make_filter< std::vector<Mat>, Mat > (
[&](std::vector<Mat> src) {
Mat dst = blendFrames(src);
return dst;
) &
make_filter<Mat, void> (
[&](Mat src) {
videoWriter.open(outputFilename, CV_FOURCC('D','I','V','X'), 30.0, src.size(), true);
videoWriter << src;
imshow("panoramic view", src);
A few questions:
Since warpFrames, and blendFrames both access the member variable
vector<PanoramaParameters> panoParams, should this member be a
concurrent_vector type? These parameters are created once in the
constructor and never updated.
What might cause the blended frame to reach the end filter out of
order with TBB?
Update 06/19/13:
Thanks to #AlexeyKukanov I was able to prove that the tokens are definitely arriving in order. What appears to happen is that either the source or the sink filters have buffering issues when all CPU cores are at 100% utilization. I have a 4-core processor, which once I allow 4 tokens in flight the CPU is completely saturated and the stuttering starts. However, when 1, 2, or 3 tokens are in flight there doesn't appear to be any stuttering.
Any help would be greatly appreciated!

It's rather a collection of knowledge and advice pieces than an answer, but it gets too long for a comment. My previous comments are also copied here.
The TBB usage in the code seems correct. To figure out if the root cause is in TBB or elsewhere in your code, I recommend to check if the frames really go out of order in the last filter, e.g. by printing order IDs assigned in the first filter. Since TBB does not expose internal token IDs, you have to assign and track IDs on your own.
Also FYI, the number of tokens does not have to be equal to the number of HW cores. Though this number effectively limits the concurrency, it's there primarily to prevent getting short of resources (e.g. memory) when lots of tokens wait for their turn in a serial filter.
Another thing to know is that it's unspecified which thread executes which filter; in fact, any thread can execute any filter. So if, for example, the sink filter draws something on the screen, you need to make sure that drawing can be done by any thread, or otherwise redirect all drawing to a single thread. As far as I know, some GUI frameworks may require all drawing to be done by a single thread, or some initialization routines being called in each thread prior to drawing.


C++ Maya - Getting mesh vertices from frame and subframe

I'm writing a mesh deformer plugin that gets info about the mesh from past frames to perform some calculations. In the past, to get past mesh info, I did the following
MStatus MyClass::deform(MDataBlock& dataBlock, MItGeometry& itGeo,
const MMatrix& localToWorldMatrix, unsigned int index)
MFnPointArrayData fnPoints;
//... other init code
MPlug meshPlug = nodeFn.findPlug(MString("inputMesh"));
// gets the mesh connection from the previous frame
MPlug meshPositionPlug = meshPlug.elementByLogicalIndex(0);
MObject objOldMesh;
// previous frame's vertices
MPointArray oldMeshPositionVertices = fnPoints.array();
// ... calculations
return MS::kSuccess;
If I needed more than one frame I'd run for-loops over logical indices and repeat the process. Since creating this however, I've found that the needs of my plugin can't just get past frames but also frames in the future as well as subframes (between integer frames). Since my current code relies on elementByLogicalIndex() to get past frame info and that only takes unsigned integers, and the 0th index refers to the previous frame, I can't get subframe information. I haven't tried getting future frame info yet but I don't think that's possible either.
How do I query mesh vertex positions in an array for past/future/sub-frames? Is my current method inflexible and, if so, how else could I do this?
So, the "intended" way to accomplish this is with an MDGContext, either with an MDGContextGuard, or with the versions of MPlug.asMObject that explicitly take a context (though these are deprecated).
Having said that - in the past when I've tried to use MDGContexts to query values at other times, I've found them either VERY slow, unstable, or both. So use with caution. It's possible that things will work better if, as you say, you're dealing purely with objects coming straight from an alembic mesh. However, if that's the case, you may have better luck reading the cache path from the node, and querying through the alembic API directly yourself.

Using multithreading in c++11 to fully utilize CPU with OpenCV

I am writing a console C++ program that analyzes films for color (and other properties) using the OpenCV Library. I have reached a significant bottleneck with the cap.retrieve() function (similar to cap.read()). This operation takes the longest out of any single function I call in my program, and takes significantly longer when reading HD videos. Despite this, I am still getting less than 50% CPU utilization when I run my program.
I have since decided that the best course of action would be to try to reach full utilization by creating a new thread each time I want to read (or "retrieve) an image from the video, with a specific maximum number of threads based upon the CPU specs. I have done some reading on the basics of multithreading with C++11, but am unsure where to start.
Here is the section of my code I would like to multithread:
// Run until all frames have been read
while(cap.get(CV_CAP_PROP_POS_FRAMES) != cap.get(CV_CAP_PROP_FRAME_COUNT)) {
// look at but do not read the frame
// only process if user wants to process it
// eg: if resolution = 20 then will only process every 20 frames
if (int(cap.get(CV_CAP_PROP_POS_FRAMES)) % resolution == 0) {
// multithread everything inside this loop??
Mat frame;
// retrieve frame and get data to solve for avg time to retrieve
double t1 = (double)getTickCount();
bool bSuccess = cap.retrieve(frame);
t1 = ((double)getTickCount() - t1)/getTickFrequency();
//if not success, break loop
if (!bSuccess) {
// Get avg color and data to calculate total avg color
// get data to solve for avg time to calc mean
// adds a single row of the avg color to the colorCloud mat
Thanks in advance for your help. Anything from links to resources to tutorials or pseudocode would be greatly appreciated!

Potential Memory Leak in cv::cvtColor using CV_BGR2HSV

I am seeing radically different memory usage on an iPad3 when calling cv::cvtColor with CV_BGR2GRAY vs CV_BGR2HSV repeatedly in the context of a video processing algorithm and I'd like some insight in to why this is or guidance on how to avoid linear memory growth with each call to cvtColor.
Here's the simplified code I wrote for testing and for context in this question:
- (void) processImage:(cv::Mat &)image {
processor.process(image, output);
image = output;
output is defined as part of the class this method is in. The processor class is defined as:
class TestFrameProcessor {
cv::Mat tmp;
cv::Scalar green;
green(0,255,0) {
void process(cv:: Mat &input, cv:: Mat &output) {
cv::cvtColor(input, tmp, CV_BGR2HSV);
output = green;
This code is only meant to be illustrative of the problem I am having not to do anything of value. When I use CV_BGR2GRAY as the conversion type I get a memory profile like this:
With a max value of around 21 MB and an average that is around 10 MB.
When I use CV_BGR2HSV I get a memory profile like this:
With a peak value of around 225 MB and a min of around 12 MB. My concern is in this linear growth of memory usage when using the BGR2HSV conversion. In this particular run there appears to be memory cleanup happening from within OpenCV, however I have had runs (did not get a screen cap of the profile) where the memory grows unbounded until all of the system memory is used up and the application dies an ungraceful death.
Questions are:
Am I misusing the cv::Mat objects or cvtColor here? Should I call release?
Has any one seen this type of memory usage before with cvtColor?
If so, is there a way to avoid it?
Are there known memory issues with the OpenCV implementation cvtColor?
(I'm hoping someone else knows the answers or has better GoogleFu skills than I apparently do)

plotting real time Data on (qwt )Oscillocope

I'm trying to create a program, using Qt (c++), which can record audio from my microphone using QAudioinput and QIODevice.
Now, I want to visualize my signal
Any help would be appreciated. Thanks
[Edit1] - copied from your comment (by Spektre)
I Have only one Buffer for both channel
I use Qt , the value of channel are interlaced on buffer
this is how I separate values
for ( int i = 0, j = 0; i < countSamples ; ++j)
YVectorLeft[j] =Samples[i++];
after I plot YvectorRight and YvectorLeft. I don't see how to trigger only one channel
hehe done this few years back for students during class. I hope you know how oscilloscopes works so here are just the basics:
fsmpl is input signal sampling frequency [Hz]
Try to use as big as possible (44100,48000, ???) so the max frequency detected is then fsmpl/2 this gives you the top of your timebase axis. The low limit is given by your buffer length
Create function that will render your sampling buffer from specified start address (inside buffer) with:
Y-scale ... amplitude setting
Y-offset ... Vertical beam position
X-offset ... Time shift or horizontal position
This can be done by modification of start address or by just X-offsetting the curve
Create function which will emulate Level functionality. So search buffer from start address and stop if amplitude cross Level. You can have more modes but these are basics you should implement:
amplitude: ( < lvl ) -> ( > lvl )
amplitude: ( > lvl ) -> ( < lvl )
There are many other possibilities for level like glitch,relative edge,...
You can put all this together for example like this: you have start address variable so sample data to some buffer continuously and on timer call level with start address (and update it). Then call draw with new start address and add timebase period to start address (of course in term of your samples)
I use Line IN so I have stereo input (A,B = left,right) therefore I can add some other stuff like:
Level source (A,B,none)
render mode (timebase,Chebyshev (Lissajous curve if closed))
Chebyshev = x axis is A, y axis is B this creates famous Chebyshev images which are good for dependent sinusoidal signals. Usually forming circles,ellipses,distorted loops ...
miscel stuff
You can add filters for channels emulating capacitance or grounding of input and much more
You need many settings I prefer analog knobs instead of buttons/scrollbars/sliders just like on real Oscilloscope
(semi)Analog values: Amplitude,TimeBase,Level,X-offset,Y-offset
discrete values: level mode(/,),level source(A,B,-),each channel (direct on,ground,off,capacity on)
Here are some screenshots of my oscilloscope:
Here is screenshot of my generator:
And finally after adding some FFT also Spectrum Analyser
I started with DirectSound but it sucks a lot because of buggy/non-functional buffer callbacks
I use WinAPI WaveIn/Out for all sound in my Apps now. After few quirks with it, is the best for my needs and has the best latency (Directsound is too slow more than 10 times) but for oscilloscope it has no merit (I need low latency mostly for emulators)
Btw. I have these three apps as linkable C++ subwindow classes (Borland)
and last used with my ATMega168 emulator for my sensor-less BLDC driver debugging
here you can try my Oscilloscope,generator and Spectrum analyser If you are confused with download read the comments below this post btw password is: "oscill"
Hope it helps if you need help with anything just comment me
[Edit1] trigger
You trigger all channels at once but the trigger condition is checked usually just from one Now the implementation is simple for example let the trigger condition be the A(left) channel rise above level so:
first make continuous playback with no trigger you wrote it is like this:
for ( int i = 0, j = 0; i < countSamples ; ++j)
YVectorLeft[j] =Samples[i++];
// here draw or FFT,draw buffers YVectorRight,YVectorLeft
Add trigger
To add trigger condition you just find sample that meets it and start drawing from it so you change it to something like this
// static or global variables
static int i0=0; // actual start for drawing
static bool _copy_data=true; // flag that new samples need to be copied
static int level=35; // trigger level value datatype should be the same as your samples...
int i,j;
for (;;)
// copy new samples to buffer if needed
if (_copy_data)
for (_copy_data=false,i=0,j=0;i<countSamples;++j)
YVectorLeft[j] =Samples[i++];
// now search for new start
for (i=i0+1;i<countSamples>>1;i++)
if (YVectorLeft[i-1]<level) // lower then level before i
if (YVectorLeft[i]>=level) // higher then level after i
if (i0>=(countSamples>>1)-view_samples) { i0=0; _copy_data=true; continue; }
// here draw or FFT,draw buffers YVectorRight,YVectorLeft from i0 position
the view_samples is the viewed/processed size of data (for one or more screens) it should be few times less then the (countSamples>>1)
this code can loose one screen on the border area to avoid that you need to implement cyclic buffers (rings) but for starters is even this OK
just encode all trigger conditions through some if's or switch statement

Detect clusters of circular objects by iterative adaptive thresholding and shape analysis

I have been developing an application to count circular objects such as bacterial colonies from pictures.
What make it easy is the fact that the objects are generally well distinct from the background.
However, few difficulties make the analysis tricky:
The background will present gradual as well as rapid intensity change.
In the edges of the container, the object will be elliptic rather than circular.
The edges of the objects are sometimes rather fuzzy.
The objects will cluster.
The object can be very small (6px of diameter)
Ultimately, the algorithms will be used (via GUI) by people that do not have deep understanding of image analysis, so the parameters must be intuitive and very few.
The problem has been address many times in the scientific literature and "solved", for instance, using circular Hough transform or watershed approaches, but I have never been satisfied by the results.
One simple approach that was described is to get the foreground by adaptive thresholding and split (as I described in this post) the clustered objects using distance transform.
I have successfully implemented this method, but it could not always deal with sudden change in intensity. Also, I have been asked by peers to come out with a more "novel" approach.
I therefore was looking for a new method to extract foreground.
I therefore investigated other thresholding/blob detection methods.
I tried MSERs but found out that they were not very robust and quite slow in my case.
I eventually came out with an algorithm that, so far, gives me excellent results:
I split the three channels of my image and reduce their noise (blur/median blur). For each channel:
I apply a manual implementation of the first step of adaptive thresholding by calculating the absolute difference between the original channel and a convolved (by a large kernel blur) one. Then, for all the relevant values of threshold:
I apply a threshold on the result of 2)
find contours
validate or invalidate contours on the grant of their shape (size, area, convexity...)
only the valid continuous regions (i.e. delimited by contours) are then redrawn in an accumulator (1 accumulator per channel).
After accumulating continuous regions over values of threshold, I end-up with a map of "scores of regions". The regions with the highest intensity being those that fulfilled the the morphology filter criteria the most often.
The three maps (one per channel) are then converted to grey-scale and thresholded (the threshold is controlled by the user)
Just to show you the kind of image I have to work with:
This picture represents part of 3 sample images in the top and the result of my algorithm (blue = foreground) of the respective parts in the bottom.
Here is my C++ implementation of : 3-7
* cv::Mat dst[3] is the result of the absolute difference between original and convolved channel.
* MCF(std::vector<cv::Point>, int, int) is a filter function that returns an positive int only if the input contour is valid.
/* Allocate 3 matrices (1 per channel)*/
cv::Mat accu[3];
/* We define the maximal threshold to be tried as half of the absolute maximal value in each channel*/
int maxBGR[3];
for(unsigned int i=0; i<3;i++){
double min, max;
maxBGR[i] = max/2;
/* In addition, we fill accumulators by zeros*/
/* This loops are intended to be multithreaded using
#pragma omp parallel for collapse(2) schedule(dynamic)
For each channel */
for(unsigned int i=0; i<3;i++){
/* For each value of threshold (m_step can be > 1 in order to save time)*/
for(int j=0;j<maxBGR[i] ;j += m_step ){
/* Temporary matrix*/
cv::Mat tmp;
std::vector<std::vector<cv::Point> > contours;
/* Thresholds dst by j*/
cv::threshold(dst[i],tmp, j, 255, cv::THRESH_BINARY);
/* Finds continous regions*/
cv::findContours(tmp, contours, CV_RETR_LIST, CV_CHAIN_APPROX_TC89_L1);
if(contours.size() > 0){
/* Tests each contours*/
for(unsigned int k=0;k<contours.size();k++){
int valid = MCF(contours[k],m_minRad,m_maxRad);
/* I found that redrawing was very much faster if the given contour was copied in a smaller container.
* I do not really understand why though. For instance,
cv::drawContours(miniTmp,contours,k,cv::Scalar(1),-1,8,cv::noArray(), INT_MAX, cv::Point(-rect.x,-rect.y));
is slower especially if contours is very long.
std::vector<std::vector<cv::Point> > tpv(1);
std::copy(contours.begin()+k, contours.begin()+k+1, tpv.begin());
/* We make a Roi here*/
cv::Rect rect = cv::boundingRect(tpv[0]);
cv::Mat miniTmp(rect.height,rect.width,CV_8U,cv::Scalar(0));
cv::drawContours(miniTmp,tpv,0,cv::Scalar(1),-1,8,cv::noArray(), INT_MAX, cv::Point(-rect.x,-rect.y));
accu[i](rect) = miniTmp + accu[i](rect);
/* Make the global scoreMap*/
/* Conditional noise removal*/
I have two questions:
What is the name of such foreground extraction approach and do you see any reason for which it could be improper to use it in this case ?
Since recursively finding and drawing contours is quite intensive, I would like to make my algorithm faster. Can you indicate me any way to achieve this goal ?
Thank you very much for you help,
Several years ago I wrote an aplication that detects cells in a microscope image. The code is written in Matlab, and I think now that is more complicated than it should be (it was my first CV project), so I will only outline tricks that will actually be helpful for you. Btw, it was deadly slow, but it was really good at separating large groups of twin cells.
I defined a metric by which to evaluate the chance that a given point is the center of a cell:
- Luminosity decreases in a circular pattern around it
- The variance of the texture luminosity follows a given pattern
- a cell will not cover more than % of a neighboring cell
With it, I started to iteratively find the best cell, mark it as found, then look for the next one. Because such a search is expensive, I employed genetic algorithms to search faster in my feature space.
Some results are given below: