Parallelism vs Threading - Performance - c++

I have been reading on the subject, but I haven't been able to find a concrete answer to my question. I am interested in using parallelism/multithreading to improve the performance of my game, but I have heard some contradicting facts. For example, that multithreading may not produce any improvement on the execution speed for a game. I
I have thought of two ways to do this:
putting the rendering component into a thread. There are some things
I would need to change, but I have a good idea of what needs to be
using openMP to parallelize the rendering function. I have already code to do so, thus this might be easier option.
This being an Uni assessment, the target hardware are my Uni's computers, which are multi-core (4 cores), and therefore I am hoping to achieve some additional efficiency using either one of those techniques.
My question, is therefore, the following: Which one should I prefer? Which normally produces the best results?
EDIT: The main function I mean to parallelize/multithread away:
void Visualization::ClipTransBlit ( int id, Vector2i spritePosition, FrameData frame, View *view )
const Rectangle viewRect = view->GetRect ();
BYTE *bufferPtr = view->GetBuffer ();
Texture *txt = txtMan_.GetTexture ( id );
Rectangle clippingRect = Rectangle ( 0, frame.frameSize.x, 0, frame.frameSize.y );
clippingRect.Translate ( spritePosition );
clippingRect.ClipTo ( viewRect );
Vector2i negPos ( -spritePosition.x, -spritePosition.y );
clippingRect.Translate ( negPos );
if ( spritePosition.x < viewRect.left_ ) { spritePosition.x = viewRect.left_; }
if ( spritePosition.y < viewRect.top_ ) { spritePosition.y = viewRect.top_; }
if (clippingRect.GetArea() == 0) { return; }
//clippingRect.Translate ( frameData );
BYTE *destPtr = bufferPtr + ((abs(spritePosition.x) - abs(viewRect.left_)) + (abs(spritePosition.y) - abs(viewRect.top_)) * viewRect.Width()) * 4; // corner position of the sprite (top left corner)
BYTE *tempSPtr = txt->GetData() + (clippingRect.left_ + clippingRect.top_ * txt->GetSize().x) * 4;
int w = clippingRect.Width();
int h = clippingRect.Height();
int endOfLine = (viewRect.Width() - w) * 4;
int endOfSourceLine = (txt->GetSize().x - w) * 4;
for (int i = 0; i < h; i++)
for (int j = 0; j < w; j++)
if (tempSPtr[3] != 0)
memcpy(destPtr, tempSPtr, 4);
destPtr += 4;
tempSPtr += 4;
destPtr += endOfLine;
tempSPtr += endOfSourceLine;

instead of calling memcpy for each pixel consider just setting the value there. the overhead in calling a function that many times could be dominating the overall execution time for this loop. E.g:
for (int i = 0; i < h; i++)
for (int j = 0; j < w; j++)
if (tempSPtr[3] != 0)
*((DWORD*)destPtr) = *((DWORD*)tempSPtr);
destPtr += 4;
tempSPtr += 4;
destPtr += endOfLine;
tempSPtr += endOfSourceLine;
you could also avoid the conditional by employing one of the tricks mentioned here avoiding conditionals - in such a tight loop conditionals can be very expensive.
edit -
as to whether it's better to run several instances of ClipTransBlit concurrently or to parallelize ClipTransBlit internally, I would say generally speaking it's better to implement parallelization at as high a level as possible to reduce the overhead you incur by setting it up (creating threads, synchronizing them, etc.)
In your case though because it looks like you're drawing sprites, if they were to overlap then without additional synchronization your high level threading might lead to nasty visual artifacts and even a race condition on checking the alpha bit. In that case the low level parallelism might be a better choice.

Theoretically, they should produce the same effect. In practice, it might be quite different.
If you print out assembly code of an OpenMP program, OpenMP simply calls some function in the scope like #pragma omp parallel .... It is similar to folk.
OpenMP is parallel computing oriented, on the other hand, multi-thread is more general.
For example, if you want to write a GUI program, multithreading is necessary(Some frameworks may hide it. It still needs multiple threads). However you never want to implement it with OpenMP.


Designing a multithreaded application that scales well

The code below is a demonstration of what I'm trying to do and it has the same problem as my original code (which is not included here). I have spectrogram code and I'm trying to improve its performance by using multiple threads (my computer has 4 cores). The spectrogram code basically computes an FFT over many overlapping frames (these frames correspond to sound samples at a particular time).
As an example let's say that we have 1000 frames which overlap by 50%.
If we're using 4 threads, then each thread should handle 250 frames. Overlapping frames just means that if our frames are 1024 samples in length, the first
frame has the range 0-1023, the second frame 512-1535, the third 1024-2047 etc (an overlap of 512 samples ).
The code creating and using the threads
void __fastcall TForm1::Button1Click(TObject *Sender)
numThreads = 4;
fftLen = 1024;
numWindows = 10000;
int startTime = GetTickCount();
numOverlappingWindows = numWindows*2;
overlap = fftLen/2;
const unsigned numElem = fftLen*numWindows+overlap;
rx = new float[numElem];
for(int i=0; i<numElem; i++) {
rx[i] = rand();
useThreads = true;
for(int i=0;i<numThreads;i++){
TWorkerThread *pWorkerThread = new TWorkerThread(true);
pWorkerThread->SetWorkerMethodCallback(&CalculateWindowFFTs);//this is called in TWorkerThread::Execute
pLock = new TCriticalSection();
for(int i=0;i<numThreads;i++){ //start the threads>Resume();
}else CalculateWindowFFTs();
int endTime = GetTickCount();
Label1->Caption = IntToStr(endTime-startTime);
void TForm1::CalculateWindowFFTs(){
unsigned startWnd = 0, endWnd = numOverlappingWindows, threadId;
threadId = TWorkerThread::GetCurrentThreadId();
unsigned wndPerThread = numOverlappingWindows/numThreads;
startWnd = (threadId-1)*wndPerThread;
endWnd = threadId*wndPerThread;
endWnd = numOverlappingWindows;
float *pReal, *pImg;
for(unsigned i=startWnd; i<endWnd; i++){
pReal = new float[fftLen];
pImg = new float[fftLen];
memcpy(pReal, &rx[i*overlap], fftLen*sizeof(float));
memset(pImg, '0', fftLen);
FFT(pReal, pImg, fftLen); //perform an in place FFT
void TForm1::FFT(float *rx, float *ix, int fftSize)
int i, j, k, m;
float rxt, ixt;
m = log(fftSize)/log(2);
int fftSizeHalf = fftSize/2;
j = k = fftSizeHalf;
for (i = 1; i < (fftSize-1); i++){
if (i < j) {
rxt = rx[j];
ixt = ix[j];
rx[j] = rx[i];
ix[j] = ix[i];
rx[i] = rxt;
ix[i] = ixt;
k = fftSizeHalf;
while (k <= j){
j = j - k;
k = k/2;
j = j + k;
} //end for
int le, le2, l, ip;
float sr, si, ur, ui;
for (k = 1; k <= m; k++) {
le = pow(2, k);
le2 = le/2;
ur = 1;
ui = 0;
sr = cos(PI/le2);
si = -sin(PI/le2);
for (j = 1; j <= le2; j++) {
l = j - 1;
for (i = l; i < fftSize; i += le) {
ip = i + le2;
rxt = rx[ip] * ur - ix[ip] * ui;
ixt = rx[ip] * ui + ix[ip] * ur;
rx[ip] = rx[i] - rxt;
ix[ip] = ix[i] - ixt;
rx[i] = rx[i] + rxt;
ix[i] = ix[i] + ixt;
} //end for
rxt = ur;
ur = rxt * sr - ui * si;
ui = rxt * si + ui * sr;
While it's easy to divide this process over multiple threads, the performance is only marginally improved compared to the single-threaded version (<10%).
Interestingly if I increase the number of threads to, say, 100, I do get an increase in speed of about 25%, which is surprising because
I'd expect that thread context-switching overhead be a factor in this case.
At first I thought that the main reason for the poor performance is a lock on writing to a vector object so I experimented with an array of vectors (a
vector per thread), thus eliminiting the need for the locks but the performance remained pretty much the same.
pVfft = new vector<float*>[numThreads];//create an array of vectors
//and then in CalculateWindowFFTs, do something like
vector<float*> &vThr = pVfft[threadId-1];
for(unsigned i=startWnd; i<endWnd; i++){
pReal = new float[fftLen];
pImg = new float[fftLen];
memcpy(pReal, &rx[i*overlap], fftLen*sizeof(float));
memset(pImg, '0', fftLen);
FFT(pReal, pImg, fftLen); //perform an in place FFT
I think I'm running into caching problems here though I'm not certain how to go about changing my design in order to have a solution that scales well.
I can also provide the code for TWorkerThread if you think that's important.
Any help is much appreciated.
As suggested by 1201ProgramAlarm I removed that while loop and got about 15-20% speed improvement on my system. Now my main thread is not actively waiting for the threads to finish but rather I have TWorkerThread execute code on the main thread via TThread::Synchronize after all the worker threads have finished (i.e.when numThreads has reached 0).
While this is looking better now, it's still far from being optimal.
The locks to write to vWndFFT will hurt, as will the repeated (leaking) calls to new assigned to pReal and pImg (these should be outside the for loop).
But the real performance killer is probably your loop waiting for the threads to finish: while(TWorkerThread::GetNumThreads()>0);. This will consume one available thread in a very unfriendly way.
One quick fix (not recommended) would be to add a sleep(1) (or 2, 5, or 10) so the loop is not continuous.
A better solution would be to have the main thread be one of your calculation threads, and have a way for that thread (once it is done with all processing) to simply wait for the other thread to finish without consuming a core, using something like WaitForMultipleObjects that is available on Windows.
One simple way to try out your threaded code is simply to run threaded, but only use one thread. Performance should be about the same as the non-threaded version, and the results should match.

OpenMP: parallel for doesn't do anything

I'm trying to make a parallel version of SIFT algorithm in OpenCV.
In particular in sift.cpp:
static void calcDescriptors(const std::vector<Mat>& gpyr, const std::vector<KeyPoint>& keypoints,
Mat& descriptors, int nOctaveLayers, int firstOctave )
#pragma omp parallel for
for( size_t i = 0; i < keypoints.size(); i++ )
calcSIFTDescriptor(img, ptf, angle, size*0.5f, d, n, descriptors.ptr<float>((int)i));
Gives already a speed-up from 84ms to 52ms on a quad-core machine. It doesn't scale so much, but it's already a good result for adding 1 line of codes.
Anyway most of the computation inside the loop is performed by calcSIFTDescriptor(), but anyway it takes on average 100us. So most of the computation time is given by the really high number of times that calcSIFTDescriptor() is called (thousands of times). So accomulating all these 100us results in several ms.
Anyway, I'm trying to optimize the calcSIFTDescriptor() performance. In particular the code is devide between two for and the following one take on average 60us:
for( k = 0; k < len; k++ )
float rbin = RBin[k], cbin = CBin[k];
float obin = (Ori[k] - ori)*bins_per_rad;
float mag = Mag[k]*W[k];
int r0 = cvFloor( rbin );
int c0 = cvFloor( cbin );
int o0 = cvFloor( obin );
rbin -= r0;
cbin -= c0;
obin -= o0;
if( o0 < 0 )
o0 += n;
if( o0 >= n )
o0 -= n;
// histogram update using tri-linear interpolation
float v_r1 = mag*rbin, v_r0 = mag - v_r1;
float v_rc11 = v_r1*cbin, v_rc10 = v_r1 - v_rc11;
float v_rc01 = v_r0*cbin, v_rc00 = v_r0 - v_rc01;
float v_rco111 = v_rc11*obin, v_rco110 = v_rc11 - v_rco111;
float v_rco101 = v_rc10*obin, v_rco100 = v_rc10 - v_rco101;
float v_rco011 = v_rc01*obin, v_rco010 = v_rc01 - v_rco011;
float v_rco001 = v_rc00*obin, v_rco000 = v_rc00 - v_rco001;
int idx = ((r0+1)*(d+2) + c0+1)*(n+2) + o0;
hist[idx] += v_rco000;
hist[idx+1] += v_rco001;
hist[idx+(n+2)] += v_rco010;
hist[idx+(n+3)] += v_rco011;
hist[idx+(d+2)*(n+2)] += v_rco100;
hist[idx+(d+2)*(n+2)+1] += v_rco101;
hist[idx+(d+3)*(n+2)] += v_rco110;
hist[idx+(d+3)*(n+2)+1] += v_rco111;
So I tried to add #pragma omp parallel for private(k) before it, and the weird thing happens: nothing happens!!!
Introducing this parallel for make the code computation on average 53ms (against 52ms of before). I would have expected one or more of the following results:
Taking >52ms given by the overhead of a new parallel for
Taking <52ms given by the gain obtained by the parallel for
Some sort of inconsistency in the result, since as you can see the shared vector hist is updated concurrently. Nothing of this happens: the result is still correct and no atomic or critical are used.
I'm an OpenMP newbie, but from I see is like this inner parllel for is like ignored. Why this happens?
NOTE: all the reported times are the average time with the same input for 10.000 times.
I tried to remove the first parallel for, leaving the one in calcSIFTDescriptor and it happened was I was expecting: inconsistency has been observed due to the lack of any thread-safety mechanism. Introducing #pragma omp critical(dataupdate) before updating hist gave consistency again but now performances are horribles: 245ms on average.
I think that this is because of the overhead given by the parallel for in calcSIFTDescriptor, which is not worth for parallelize 30us.
BUT THE QUESTION STILL REMAINS: why the first version (with two parallel for) didn't produce any change (both in performance and consistency)?
I found out the answer by myself: the second (nested) parallel for doesn't make any effect for the reason described here:
OpenMP parallel regions can be nested inside each other. If nested
parallelism is disabled, then the new team created by a thread
encountering a parallel construct inside a parallel region consists
only of the encountering thread. If nested parallelism is enabled,
then the new team may consist of more than one thread.
So since the first parallel for takes all the possible thread, the second one has as team the encountering thread itself. So nothing happens.
Cheers to myself!

CUDA C++ shared memory and if-condition

i have a question i couldnt find an answer to myself, and i was hoping some of you could offer me some insight regarding a possible solution. Within a kernel call, i would like to insert an if-condition regarding access to shared memory.
__global__ void GridFillGPU (int * gridGLOB, int n) {
__shared__ int grid[SIZE] // ... initialized to zero
int tid = threadIdx.x
if (tid < n) {
for ( int k = 0; k < SIZE; k++) {
if (grid[k] == 0) {
grid[k] = tid+1;
//... here write grid to global memory gridGLOB
The idea is that, if the element grid[k] has already been written by one thread (with the index tid), it should not be written by another one. My question is: can this even be done in parallel ? Since all parallel threads perform the same for-loop, how can i be sure that the if-condition is evaluated correctly? I am guessing this will lead to certain race-conditions. I am quite new to Cuda, so i hope this question is not stupid. I know that grid needs to be in shared memory, and that one should avoid if-statements, but i find no other way around at the moment.
I am thankful for any help
EDIT: here is the explicit version, which explains why the array is called grid
__global__ void GridFillGPU (int * pos, int * gridGLOB, int n) {
__shared__ int grid[SIZE*7] // ... initialized to zero
int tid = threadIdx.x
if (tid < n) {
int jmin = pos[tid] - 3;
int jmax = pos[tid] + 3;
for ( int j = jmin; j <= jmax; j++ {
for ( int k = 0; k < SIZE; k++) {
if (grid[(j-jmin)*SIZE + k] == 0) {
grid[(j-jmin)*SIZE + k] = tid+1;
} //... here write grid to global memory gridGLOB
You should model you problem in a way you don't need to worry about "if has been written already", also because cuda offers no guarantee in the order in which thread will be executed, so the order might not be the way you excpect.
There are some minor things that cuda ensure you order wise within a warp but that is not the case.
There are sync barries and stuff you can use but I don't think is your case.
if you are processing a grid you should model that in a way that each thread has its own region of memory is going to work on. and that should not overlap with other thread region (at least in writing, in reading you can go outside boundaries). Also I would not worry about shared memory, make the algorithm works first, then think about optimization like load a tile in shared memory using the warp.
In that case if you want to split your domain in a grid you should setup the kernel, in order to have enough threads as your grid "cells" or pixels if is an image. Then you use the thread and block coordinates that cuda provides you to compute where you should read and write in memory.
There is a really good course on about cuda, you might want to have a look at that.
There is also another one on but I don't know if it is open right now.
Anyway dividing the domain in a grid is a really common and solved problem ,you can find a lot of material on that.

How to use cv::parallel_for_ for execution time reduction

I created an image processing algorithm using OpenCV and currently I'm trying to improve the time efficiency of my own, simple function which is similar to LUT, but with interpolation between values (double calibRI::corr(double)).
I optimized the pixel loop according to the OpenCV docs.
Non parallel function (calib(cv::Mat) -an object of calibRI functor class) takes about 0.15s. I decided to use cv::parallel_for_ to make it shorter.
First I implemented it as image tiling -according to >> this document. The time was reduced to 0.12s (4 threads).
virtual void operator()(const cv::Range& range) const
for(int i = range.start; i < range.end; i++)
// divide image in 'thr' number of parts and process simultaneously
cv::Rect roi(0, (img.rows/thr)*i, img.cols, img.rows/thr);
cv::Mat in = img(roi);
cv::Mat out = retVal(roi);
out = calib(in); //loops over all pixels and does out[u,v]=calibRI::corr(in[u,v])
I though that running my function in parallel for subimages/tiles/ROIs is not yet optimal, so I implemented it as below:
template <typename T>
class ParallelPixelLoop : public cv::ParallelLoopBody
typedef boost::function<T(T)> pixelProcessingFuntionPtr;
cv::Mat& image; //source and result image (to be overwritten)
bool cont; //if the image is continuous
size_t rows;
size_t cols;
size_t threads;
std::vector<cv::Range> ranges;
pixelProcessingFuntionPtr pixelProcessingFunction; //pixel modif. function
ParallelPixelLoop(cv::Mat& img, pixelProcessingFuntionPtr fun, size_t thr = 4)
: image(img), cont(image.isContinuous()), rows(img.rows), cols(img.cols), pixelProcessingFunction(fun), threads(thr)
int groupSize = 1;
if (cont) {
cols *= rows;
rows = 1;
groupSize = ceil( cols / threads );
else {
groupSize = ceil( rows / threads );
int t = 0;
for(t=0; t<threads-1; ++t) {
ranges.push_back( cv::Range( t*groupSize, (t+1)*groupSize ) );
ranges.push_back( cv::Range( t*groupSize, rows<=1?cols:rows ) ); //last range must be to the end of image (ceil used before)
virtual void operator()(const cv::Range& range) const
for(int r = range.start; r < range.end; r++)
T* Ip = nullptr;
cv::Range ran =;
if(cont) {
Ip = image.ptr<T>(0);
for (int j = ran.start; j < ran.end; ++j)
Ip[j] = pixelProcessingFunction(Ip[j]);
else {
for(int i = ran.start; i < ran.end; ++i)
Ip = image.ptr<T>(i);
for (int j = 0; j < cols; ++j)
Ip[j] = pixelProcessingFunction(Ip[j]);
Then I run it on 1280x1024 64FC1 image, on i5 processor, Win8, and get the time in range of 0.4s using the code below:
double t = cv::getTickCount();
ParallelPixelLoop<double> loop(V,boost::bind(&calibRI::corr,this,_1),4);
std::cout << "Exec time: " << (cv::getTickCount()-t)/cv::getTickFrequency() << "s\n";
I have no idea why is my implementation so much slower than iterating all the pixels in subimages... Is there a bug in my code or the OpenCV ROIs are optimized in some special way?
I do not think there is a time measurement error issue, as described here. I'm using OpenCV time functions.
Is there any other way to reduce the time of this function?
Thanks in advance!
Generally it's really hard to say why using cv::parallel_for failed to speed up whole process. One possibility is that the problem is not related to processing/multithreading, but to time measurement. About 2 months ago i tried to optimize this algorithm and i noticed strange thing - first time i use it, it takes x ms, but if use use it second, third, ... time (of course without restarting application) it takes about x/2 (or even x/3) ms. I'm not sure what causes this behaviour - most likely (in my opinion) it's causes by branch prediction - when code is executed first time branch predictor "learns" which paths are usually taken, so next time it can predict which branch to take(and usually the guess will be correct). You can read more about it here - it's really good question and it can open your eyes for some quite important thing.
So, in your situation i would try few things:
measure it many times - 100 or 1000 should be enough (if it takes 0.12-0.4s it won't take much time) and see whether the last version of you code still is the slowest one. So just replace your code with this:
double t = cv::getTickCount();
for (unsigned int i=0; i<1000; i++) {
ParallelPixelLoop loop(V,boost::bind(&calibRI::corr,this,_1),4);
std::cout << "Exec time: " << (cv::getTickCount()-t)/cv::getTickFrequency() << "s\n";
test it on bigger image. Maybe in your situation you just "don't need" 4 cores, but on bigger image 4 cores will make positive difference.
Use profiler (for example Very Sleepy) to see what part of your code is critical

no performance improvement with std::thread

I am working on a audio "real time" application and I would like to imrpove the performance of it. I actually already posted a topic but this is about std::thread specificly.
The audio processing ist mostly done by two seperate objects ( leftProcessor track and rightProcessor). Since these objects don't rely on each other, using two threads to process them should realy improve the performance with multi core CPUs. However, I currently get the opposite result.
Before I activated the compiler performance optimization (O2), using two threads got me about 50% more performance, but after I switched the optimization on, I got ~10-20% less performance, while the performance got drasticly better for both versions.
I measure the performance by taking the time at two points within the function and bruteforce printing the result to the screen so there could be some problems with this. ;)
My guess would be that creating a std::thread would take more time than I actually gain from running the processing on the second thread.
In this case, is it possible to improve the performance by using the same "thread" for every function call and just passing the thread the new arguments? I don't realy know if this is possible.
The function currently takes about 0.0005ms to 0.002ms to process.
Here is the code:
void AudioController::processAudio(int frameCount, float *output) {
std::thread rightProcessorThread;
if(rightLoaded) {
rightProcessorThread = std::thread(&AudioProcessor::tick, //function
rightProcessor, //object
rightFrameBuffer, //arg1
frameCount); //arg2
} else {
for(int i = 0; i < frameCount; i++) {
rightFrameBuffer[i].leftSample = 0.0f;
rightFrameBuffer[i].rightSample = 0.0f;
if(rightLoaded) {
rightProcessor->tick(rightFrameBuffer, frameCount);
} else {
for(int i = 0; i < frameCount; i++) {
rightFrameBuffer[i].leftSample = 0.0f;
rightFrameBuffer[i].rightSample = 0.0f;
Frame * leftFrameBuffer = (Frame*) output;
if(leftLoaded) {
leftProcessor->tick(leftFrameBuffer, frameCount);
} else {
for(int i = 0; i < frameCount; i++) {
leftFrameBuffer[i].leftSample = 0.0f;
leftFrameBuffer[i].rightSample = 0.0f;
if(rightLoaded) {
// MIX
for(int i = 0; i < frameCount; i++ ) {
leftFrameBuffer[i] = volume * (leftRightMix * leftFrameBuffer[i] + (1.0 - leftRightMix) * rightFrameBuffer[i]);