Designing a multithreaded application that scales well - c++

The code below is a demonstration of what I'm trying to do and it has the same problem as my original code (which is not included here). I have spectrogram code and I'm trying to improve its performance by using multiple threads (my computer has 4 cores). The spectrogram code basically computes an FFT over many overlapping frames (these frames correspond to sound samples at a particular time).
As an example let's say that we have 1000 frames which overlap by 50%.
If we're using 4 threads, then each thread should handle 250 frames. Overlapping frames just means that if our frames are 1024 samples in length, the first
frame has the range 0-1023, the second frame 512-1535, the third 1024-2047 etc (an overlap of 512 samples ).
The code creating and using the threads
void __fastcall TForm1::Button1Click(TObject *Sender)
{
numThreads = 4;
fftLen = 1024;
numWindows = 10000;
int startTime = GetTickCount();
numOverlappingWindows = numWindows*2;
overlap = fftLen/2;
const unsigned numElem = fftLen*numWindows+overlap;
rx = new float[numElem];
for(int i=0; i<numElem; i++) {
rx[i] = rand();
}
useThreads = true;
vWThread.reserve(numOverlappingWindows);
if(useThreads){
for(int i=0;i<numThreads;i++){
TWorkerThread *pWorkerThread = new TWorkerThread(true);
pWorkerThread->SetWorkerMethodCallback(&CalculateWindowFFTs);//this is called in TWorkerThread::Execute
vWThread.push_back(pWorkerThread);
}
pLock = new TCriticalSection();
for(int i=0;i<numThreads;i++){ //start the threads
vWThread.at(i)->Resume();
}
while(TWorkerThread::GetNumThreads()>0);
}else CalculateWindowFFTs();
int endTime = GetTickCount();
Label1->Caption = IntToStr(endTime-startTime);
}
void TForm1::CalculateWindowFFTs(){
unsigned startWnd = 0, endWnd = numOverlappingWindows, threadId;
if(useThreads){
threadId = TWorkerThread::GetCurrentThreadId();
unsigned wndPerThread = numOverlappingWindows/numThreads;
startWnd = (threadId-1)*wndPerThread;
endWnd = threadId*wndPerThread;
if(numThreads==threadId){
endWnd = numOverlappingWindows;
}
}
float *pReal, *pImg;
for(unsigned i=startWnd; i<endWnd; i++){
pReal = new float[fftLen];
pImg = new float[fftLen];
memcpy(pReal, &rx[i*overlap], fftLen*sizeof(float));
memset(pImg, '0', fftLen);
FFT(pReal, pImg, fftLen); //perform an in place FFT
pLock->Acquire();
vWndFFT.push_back(pReal);
vWndFFT.push_back(pImg);
pLock->Release();
}
}
void TForm1::FFT(float *rx, float *ix, int fftSize)
{
int i, j, k, m;
float rxt, ixt;
m = log(fftSize)/log(2);
int fftSizeHalf = fftSize/2;
j = k = fftSizeHalf;
for (i = 1; i < (fftSize-1); i++){
if (i < j) {
rxt = rx[j];
ixt = ix[j];
rx[j] = rx[i];
ix[j] = ix[i];
rx[i] = rxt;
ix[i] = ixt;
}
k = fftSizeHalf;
while (k <= j){
j = j - k;
k = k/2;
}
j = j + k;
} //end for
int le, le2, l, ip;
float sr, si, ur, ui;
for (k = 1; k <= m; k++) {
le = pow(2, k);
le2 = le/2;
ur = 1;
ui = 0;
sr = cos(PI/le2);
si = -sin(PI/le2);
for (j = 1; j <= le2; j++) {
l = j - 1;
for (i = l; i < fftSize; i += le) {
ip = i + le2;
rxt = rx[ip] * ur - ix[ip] * ui;
ixt = rx[ip] * ui + ix[ip] * ur;
rx[ip] = rx[i] - rxt;
ix[ip] = ix[i] - ixt;
rx[i] = rx[i] + rxt;
ix[i] = ix[i] + ixt;
} //end for
rxt = ur;
ur = rxt * sr - ui * si;
ui = rxt * si + ui * sr;
}
}
}
While it's easy to divide this process over multiple threads, the performance is only marginally improved compared to the single-threaded version (<10%).
Interestingly if I increase the number of threads to, say, 100, I do get an increase in speed of about 25%, which is surprising because
I'd expect that thread context-switching overhead be a factor in this case.
At first I thought that the main reason for the poor performance is a lock on writing to a vector object so I experimented with an array of vectors (a
vector per thread), thus eliminiting the need for the locks but the performance remained pretty much the same.
pVfft = new vector<float*>[numThreads];//create an array of vectors
//and then in CalculateWindowFFTs, do something like
vector<float*> &vThr = pVfft[threadId-1];
for(unsigned i=startWnd; i<endWnd; i++){
pReal = new float[fftLen];
pImg = new float[fftLen];
memcpy(pReal, &rx[i*overlap], fftLen*sizeof(float));
memset(pImg, '0', fftLen);
FFT(pReal, pImg, fftLen); //perform an in place FFT
vThr.push_back(pReal);
}
I think I'm running into caching problems here though I'm not certain how to go about changing my design in order to have a solution that scales well.
I can also provide the code for TWorkerThread if you think that's important.
Any help is much appreciated.
Thanks
UPDATE:
As suggested by 1201ProgramAlarm I removed that while loop and got about 15-20% speed improvement on my system. Now my main thread is not actively waiting for the threads to finish but rather I have TWorkerThread execute code on the main thread via TThread::Synchronize after all the worker threads have finished (i.e.when numThreads has reached 0).
While this is looking better now, it's still far from being optimal.

The locks to write to vWndFFT will hurt, as will the repeated (leaking) calls to new assigned to pReal and pImg (these should be outside the for loop).
But the real performance killer is probably your loop waiting for the threads to finish: while(TWorkerThread::GetNumThreads()>0);. This will consume one available thread in a very unfriendly way.
One quick fix (not recommended) would be to add a sleep(1) (or 2, 5, or 10) so the loop is not continuous.
A better solution would be to have the main thread be one of your calculation threads, and have a way for that thread (once it is done with all processing) to simply wait for the other thread to finish without consuming a core, using something like WaitForMultipleObjects that is available on Windows.
One simple way to try out your threaded code is simply to run threaded, but only use one thread. Performance should be about the same as the non-threaded version, and the results should match.

Related

Optimizing the Vivado HLS code to reduce the latency for image processing algorithm

I am trying to implement an image processing algorithm for a gamut mapping filter for Hardware using Vivado HLS. I have created a synthesizable version from a Halide code. But it is taking way too long for an image of (256x512) it was taking around 135 seconds which shouldn't be the case. I have used some optimizing techniques like pipelining the innermost loop, By pipelining, I have set the target(initiation interval) of II=1 for the innermost loop but the acheived II is 6. From the warnings thrown by the compiler, I have understood that it is because of accessing of the weights like ctrl_pts & weights, From the tutorials, I have seen, using array partitioning and array reshaping would help with the faster accessing of the weights. I have shared the code I have used to synthesize below:
//header
include "hls_stream.h"
#include <ap_fixed.h>
//#include <ap_int.h>
#include "ap_int.h"
typedef ap_ufixed<24,24> bit_24;
typedef ap_fixed<11,8> fix;
typedef unsigned char uc;
typedef ap_uint<24> stream_width;
//typedef hls::stream<uc> Stream_t;
typedef hls::stream<stream_width> Stream_t;
struct pixel_f
{
float r;
float g;
float b;
};
struct pixel_8
{
uc r;
uc g;
uc b;
};
void gamut_transform(int rows,int cols,Stream_t& in,Stream_t& out, float ctrl_pts[3702][3],float weights[3702][3],float coefs[4][3],float num_ctrl_pts);
//core
//include the header
#include "gamut_header.h"
#include "hls_math.h"
void gamut_transform(int rows,int cols, Stream_t& in,Stream_t& out, float ctrl_pts[3702][3],float weights[3702][3],float coefs[4][3],float num_ctrl_pts)
{
#pragma HLS INTERFACE axis port=in
#pragma HLS INTERFACE axis port=out
//#pragma HLS INTERFACE fifo port=out
#pragma HLS dataflow
pixel_8 input;
pixel_8 new_pix;
bit_24 temp_in,temp_out;
pixel_f buff_1,buff_2,buff_3,buff_4,buff_5;
float dist;
for (int i = 0; i < 256; i++)
{
for (int j = 0; i < 512; i++)
{
temp_in = in.read();
input.r = (temp_in & 0xFF0000)>>16;
input.g = (temp_in & 0x00FF00)>>8;
input.b = (temp_in & 0x0000FF);
buff_1.r = ((float)input.r)/256.0;
buff_1.g = ((float)input.g)/256.0;
buff_1.b = ((float)input.b)/256.0;
for(int idx =0; idx < 3702; idx++)
{
buff_2.r = buff_1.r - ctrl_pts[idx][0];
buff_2.g = buff_1.g - ctrl_pts[idx][1];
buff_2.b = buff_1.b - ctrl_pts[idx][2];
dist = sqrt((buff_2.r*buff_2.r)+(buff_2.g*buff_2.g)+(buff_2.b*buff_2.b));
buff_3.r = buff_2.r + (weights[idx][0] * dist);
buff_3.g = buff_2.g + (weights[idx][1] * dist);
buff_3.b = buff_2.b + (weights[idx][2] * dist);
}
buff_4.r = buff_3.r + coefs[0][0] + buff_1.r* coefs[1][0] + buff_1.g * coefs[2][0] + buff_1.b* coefs[3][0];
buff_4.g = buff_3.g + coefs[0][1] + buff_1.r* coefs[1][1] + buff_1.g * coefs[2][1] + buff_1.b* coefs[3][1];
buff_4.b = buff_3.b + coefs[0][2] + buff_1.r* coefs[1][2] + buff_1.g * coefs[2][2] + buff_1.b* coefs[3][2];
buff_5.r = fmin(fmax((float)buff_4.r, 0.0), 255.0);
buff_5.g = fmin(fmax((float)buff_4.g, 0.0), 255.0);
buff_5.b = fmin(fmax((float)buff_4.b, 0.0), 255.0);
new_pix.r = (uc)buff_4.r;
new_pix.g = (uc)buff_4.g;
new_pix.b = (uc)buff_4.b;
temp_out = ((uc)new_pix.r << 16 | (uc)new_pix.g << 8 | (uc)new_pix.b);
out<<temp_out;
}
}
}
Even with the achieved II=6, the time taken is around 6 seconds; The given target is to have the time taken in milliseconds. I tried to do pipelining for the second most inner loop, but I am running out of resources on my board when I do that as the third most inner loop is being unrolled. I am using zynq ultra-scale board which has a fair amount of resources. Any suggestions on optimizing the code will be highly appreciated.
Also, can anyone suggest what type of interface would be best suited for ctrl_pts,weights and coefs, For reading the image I understood that streaming interface helps, and for reading small values like the number of rows and columns, Axi lite is preferred? Is there a type of interface that I can use for the mentioned variables so that it can go hand in hand with array partitioning and array reshaping?
Any suggestions will be highly appreciated,
Thanks in advance
Edit: I understand that the fixed-point representation can bring down the latency further, But my first goal is to get the floating-point representation with the best result and then analyzing the performance with fixed point representation
There are some steps you can do to optimize your design, but bear in mind that if you really need a floating square root operation, that will most likely have a huge latency penalty (unless properly pipelined, of course).
Your code might have a typo in the second inner loop: the index should be j right?
Data Locality
First off: ctrl_pts is read multiple time from the main memory (I assume). Since it's reused 256x512 times, it would be better to store it into a local buffer on the FPGA (like a BRAM, but it can be inferred), like so:
for(int i =0; i < 3702; i++) {
for (int j = 0; j < 3; ++j) {
#pragma HLS PIPELINE II=1
ctrl_pts_local[i][j] = ctrl_pts[i][j];
}
}
for (int i = 0; i < 256; i++) {
for (int j = 0; i < 512; i++) {
// ...
buff_2.r = buff_1.r - ctrl_pts_local[idx][0];
// ...
Same reasoning goes for coefs and weights, just store them in a local variable before running the rest of the code.
To access the arguments you can use a master AXI4 interface m_axi and configure it accordingly. Once the algorithm is dealing with the local buffers, HLS should be able to automatically partition the buffers accordingly. If not, you can place the ARRAY_PARTITION complete dim=0 pragmas to force it.
Dataflow
Because of the way your algorithm works, another thing you could try is to break down the main loops (256x512) into three smaller processes running in dataflow, and so in parallel (+3 if you include the setup ones)
The whole code will look something like this (I hope it renders correctly):
[Compute buff_1]-->[FIFO1]-->[compute buff_3]-->[FIFO2a]-->[compute buff_4 and buff_5 + stream out]
L-------------------------------->[FIFO2b]----^
One tricky thing would be to stream buff_1 to both the next processes.
Possible Code
I won't try this code, so there might be compilations errors along the way, but the whole accelerator code would look something like this:
for(int i =0; i < 3702; i++) {
for (int j = 0; j < 3; ++j) {
#pragma HLS PIPELINE II=1
ctrl_pts_local[i][j] = ctrl_pts[i][j];
weights_local[i][j] = weights[i][j];
}
}
for(int i =0; i < 4; i++) {
for (int j = 0; j < 3; ++j) {
#pragma HLS PIPELINE II=1
coefs_local[i][j] = coefs[i][j];
}
}
Process_1:
for (int i = 0; i < 256; i++) {
for (int j = 0; i < 512; i++) {
#pragma HLS PIPELINE II=1
temp_in = in.read();
input.r = (temp_in & 0xFF0000)>>16;
input.g = (temp_in & 0x00FF00)>>8;
input.b = (temp_in & 0x0000FF);
buff_1.r = ((float)input.r)/256.0;
buff_1.g = ((float)input.g)/256.0;
buff_1.b = ((float)input.b)/256.0;
fifo_1.write(buff_1); // <--- WRITE TO FIFOs
fifo_2b.write(buff_1);
}
}
Process_2:
for (int i = 0; i < 256; i++) {
for (int j = 0; i < 512; i++) {
for(int idx =0; idx < 3702; idx++) {
#pragma HLS LOOP_FLATTEN // <-- It shouldn't be necessary, since the if statements already help
#pragma HLS PIPELINE II=1 // <-- The PIPELINE directive can go here
if (idx == 0) {
buff_1 = fifo_1.read(); // <--- READ FROM FIFO
}
buff_2.r = buff_1.r - ctrl_pts_local[idx][0];
buff_2.g = buff_1.g - ctrl_pts_local[idx][1];
buff_2.b = buff_1.b - ctrl_pts_local[idx][2];
dist = sqrt((buff_2.r*buff_2.r)+(buff_2.g*buff_2.g)+(buff_2.b*buff_2.b));
buff_3.r = buff_2.r + (weights_local[idx][0] * dist);
buff_3.g = buff_2.g + (weights_local[idx][1] * dist);
buff_3.b = buff_2.b + (weights_local[idx][2] * dist);
if (idx == 3702 - 1) {
fifo_2a.write(buff_3); // <-- WRITE TO FIFO
}
}
}
}
Process_3:
for (int i = 0; i < 256; i++) {
for (int j = 0; i < 512; i++) {
#pragma HLS PIPELINE II=1
buff_3 = fifo_2a.read(); // <--- READ FROM FIFO
buff_1 = fifo_2b.read(); // <--- READ FROM FIFO
buff_4.r = buff_3.r + coefs_local[0][0] + buff_1.r* coefs_local[1][0] + buff_1.g * coefs_local[2][0] + buff_1.b* coefs[3][0];
buff_4.g = buff_3.g + coefs_local[0][1] + buff_1.r* coefs_local[1][1] + buff_1.g * coefs_local[2][1] + buff_1.b* coefs_local[3][1];
buff_4.b = buff_3.b + coefs_local[0][2] + buff_1.r* coefs_local[1][2] + buff_1.g * coefs_local[2][2] + buff_1.b* coefs_local[3][2];
buff_5.r = fmin(fmax((float)buff_4.r, 0.0), 255.0);
buff_5.g = fmin(fmax((float)buff_4.g, 0.0), 255.0);
buff_5.b = fmin(fmax((float)buff_4.b, 0.0), 255.0);
new_pix.r = (uc)buff_4.r;
new_pix.g = (uc)buff_4.g;
new_pix.b = (uc)buff_4.b;
temp_out = ((uc)new_pix.r << 16 | (uc)new_pix.g << 8 | (uc)new_pix.b);
out<<temp_out;
}
}
Be extremely careful in sizing the depth of the FIFOs, since Process 2 (the one with the sqrt operation) might have a slower data consumption and production rates! Also, FIFO 2b needs to account for that delay. If the rates are not matching, there will be deadlocks. Make sure to have a meaningful testbench and to cosimulate your design.
(The FIFO's depth can be changed with the pragma #pragma HLS STREAM variable=fifo_1 depth=N).
Final Thoughs
There might be further smaller/detailed optimizations that can be performed along the way, but I would first start from the ones above, being the heaviest. Just bear in minds that floating point processing is not optimal on FPGAs (as you noted) and is usually avoided.
EDITs: I tried the code with the modifications above and I've achieved II=1 with decent resouce usage.
Since the II is now one, the ideal number of cycles the accelerator would take is 256x512 and I'm close to that: ideal 402,653,184 versus mine 485,228,587). One crazy idea I now have to propose to you is to split the Process_2 inner-most loop into two parallel branches (or even more than 2 actually), feeding their own FIFOs. Process_1 would supply the two branches while an additional process/loop will alternatively read from the two FIFOs the 256x512 elements and supply them in the correct order to Process_3. In this way, the total amount of cycles required should halve, since Process_2 is the slowest process in the dataflow (and so improving it will improve the whole design). One possible drawback of this approach will be a higher amount of area/resource required on the FPGA.
Good luck.

Multi-threading molecular simulations in C++

I am developing a molecular dynamics simulation code in C++, which essentially takes atom positions and other properties as input and simulates their motion under Newton's laws of motion. The core algorithm uses what's called the Velocity Verlet scheme and looks like:
// iterate through time (k=[1,#steps])
double Dt = 0.002; // time step
double Ttot = 1.0; // total time
double halfDt = Dt/2.0;
for (int k = 1; k*Dt <= Ttot; k++){
for (int i = 0; i < number_particles; i++)
vHalf[i] = p[i].velocity + F[i]*halfDt; // step 1
for (int i = 0; i < number_particles; i++)
p[i].position += vHalf[i]*Dt; // step 2
for (int i = 0; i < number_particles; i++)
F[i] = Force(p,i); // recalculate force on all particle i's
for (int i = 0; i < number_particles; i++)
p[i].velocity = vHalf[i] + F[i]*halfDt; // step 3
}
Where p is an array of class objects which store things like particle position, velocity, mass, etc. and Force is a function that calculates the net force on a particle using something like Lennard-Jones potential.
My question regards the time required to complete the calculation; all of my subroutines are optimized in terms of crunching numbers (e.g. using x*x*x to raise to the third power instead of pow(x,3)), but the main issue is the time loop will often be performed for millions of iterations and there are typically close to a million particles. Is there any way to implement this algorithm using multi-threading? From my understanding, multi-threading essentially opens another stream of data to and from a CPU core, which would allow me to run two different simulations at the same time; I would like to use multi-threading to make just one of these simulations run faster
I'd recommend using OpenMP.
Your specific use case is trivially parallelizable.
Prallelization should be as simple as:
double Dt = 0.002; // time step
double Ttot = 1.0; // total time
double halfDt = Dt/2.0;
for (int k = 1; k*Dt <= Ttot; k++){
#pragma omp parallel for
for (int i = 0; i < number_particles; i++)
vHalf[i] = p[i].velocity + F[i]*halfDt; // step 1
p[i].position += vHalf[i]*Dt; // step 2
#pragma omp parallel for
for (int i = 0; i < number_particles; i++)
F[i] = Force(p,i); // recalculate force on all particle i's
p[i].velocity = vHalf[i] + F[i]*halfDt; // step 3
}
Most popular compilers and platforms have support for OpenMP.

Function started with std::async crashes after quite a few iterations

I am trying to develop a simple evolution algorithm in C++. To make my calculations faster I decided to use async functions to run multiple calculations at once:
std::vector<std::future<int> > compute(8);
unsigned nptr = 0;
int syncp = 0;
while(nptr != network::networks.size()){
compute.at(syncp) = std::async(&network::analyse, &network::networks.at(nptr), data, width, height, sw, dFnum.at(idx));
syncp++;
if(syncp == 8){
syncp = 0;
for(unsigned i = 0; i < 8; i++){
compute.at(i).get();
}
}
nptr++;
}
This is how I start my calculating function. The function is called analyse, and for each "network" it assigns a score depending on how good it identifies the image.
This is part of the analyse function:
for(unsigned i = 0; i < entry.size(); i++){
double sum = 0;
data * d = &entry.at(i);
pattern * p = &pattern::patterns.at(d->patNo);
int sx = iWidth;
int sy = iHeight;
if(d->xPercentage*iWidth + d->xSpan*iWidth < sx) sx = d->xPercentage*iWidth + d->xSpan*iWidth;
if(d->yPercentage*iHeight + d->xSpan*iWidth < sy) sy = d->yPercentage*iHeight + d->xSpan*iWidth;
int xdisp = sx-d->xPercentage*iWidth;
int ydisp = sy-d->yPercentage*iHeight;
for(int x = d->xPercentage*iWidth; x < sx; x++){
for(int y = d->yPercentage*iHeight; y < sy; y++){
double xpl = x-d->xPercentage*iWidth;
double ypl = y-d->yPercentage*iHeight;
xpl /= xdisp;
ypl /= ydisp;
unsigned idx = (unsigned)(xpl*(p->width) + ypl*(p->height)*(p->width));
if(idx >= p->lweight.size()) idx = p->lweight.size()-1;
double weight = p->lweight.at(idx) - 5;
if(imageData[y*iWidth+x])
sum += weight;
else
sum -= 2*weight;
}
}
digitWeight[d->digit-1] += sum;
}
}
Now, there is no need to analyse the function itself - I'm sure it works, I have tested it on a single thread, and it runs just fine. The only problem is, after some time of execution, I get errors like segmentation fault, or vector range check error.
They mostly happen at this line:
digitWeight[d->digit-1] += sum;
Now, you can be sure that d->digit-1 is a valid range for this array.
The problem is that the value of the d pointer is different than it was here:
data * d = &entry.at(i);
It magically changes during the execution of the function, and starts pointing to different data, leading to errors. I have tried saving the value of d->digit to some variable and later use this variable, and it worked fine for just a while longer, before crashing on another shared resource, imageData this time.
I'm thinking this might be something related to data sharing - all async functions share the same array of data - it's a static vector. But this data is only read, not written anywhere, so why would it stop working? I know of something called mutex locking, but this would make no sense to lock this async functions, as it would run just as slow as a single threaded program would run.
I have also tried running the functions like this:
std::vector<std::thread*> threads(8);
unsigned nptr = 0;
int threadp = 0;
while(nptr != network::networks.size()){
threads.at(threadp) = new std::thread(&network::analyse, &network::networks.at(nptr), data, width, height, sw, dFnum.at(idx));
threadp++;
if(threadp == 8){
threadp = 0;
for(unsigned i = 0; i < 8; i++){
if(threads.at(i)->joinable()) threads.at(i)->join();
delete threads.at(i);
}
}
nptr++;
}
and it did work for a second, but after some time a very similar error appeared.
Data is a structure containing 7 integers, one of which is an ID of
pattern, and pattern is a class that contains two integers - width and height
and vector of chars.
Why does it happen on read-only data and how can I prevent it?
Here is an example of what happens:

Qimage setPixel with openmp parallel for doesn't work

The code works without parallelism, but when I add pragma omp parallel, it doesn't work. Furthermore, the code works perfectly with pragma omp parallel if I don't add setPixel. So, I would like to know why the parallelism doesn't work properly and exits the program with code 255 when I try to set pixel in the new image. This code wants to change an image doing two loops to change every pixel using a Gauss vector. If something can't be understood I'll solve it inmediately.
for (h = 0; h < height; h++){
QRgb* row = (QRgb*) result->scanLine(h);
//#pragma omp parallel for schedule(dynamic) num_threads(cores) private (j, auxazul, auxrojo, auxverde) reduction(+:red,green,blue)
for (w = 0; w < width; w++) {
red=green=blue=0;
minj = max((M-w),0);
supj = min((width+M-w),N);
for (j=minj; j<supj; j++){
auxazul = azul [w-M+j][h];
auxrojo = rojo [w-M+j][h];
auxverde = verde [w-M+j][h];
red += vectorGauss[j]*auxrojo;
green += vectorGauss[j]*auxverde;
blue += vectorGauss[j]*auxazul;
}
red /= 256; green /= 256; blue /= 256;
//result->setPixel(w,h,QColor(red,green,blue).rgba());
row[w] = QColor(red,green,blue).rgba();
}
QImage::setPixel is not thread safe, since it calls the detach() method (have a look at the official documentation here). Remember QImage uses implicit sharing.
Besides, setPixel() is extremely slow. If you are seeking performance (as someone usually do when dealing with parallel implementations), that's not the best way to go.
Using scanLine() as you already do in the example provided is the correct way of doing it.
Beside the comment that setPixel is slow and not thread safe, you currently have a race condition when writing the result
row[w] = QColor(red,green,blue).rgba();
Your code is slow in the first place because you are accessing your color matrices in a memory inefficient way. Pumping threads will make this part worse. Given that you loop on each scanline, you would like to have the transposee of your color matrices. Which allow you to do :
for (h = 0; h < height; h++){
QRgb* row = (QRgb*) result->scanLine(h);
auto azulscan = azul [h];
auto rojoscan = rojo [h];
auto verdescan = verde [h];
for (w = 0; w < width; w++) {
red=green=blue=0;
minj = max((M-w),0);
supj = min((width+M-w),N);
for (j=minj; j<supj; j++){
auto auxazul = azulscan [w-M+j];
auto auxrojo = rojoscan [w-M+j];
auto auxverde = verdescan [w-M+j];
red += vectorGauss[j]*auxrojo;
green += vectorGauss[j]*auxverde;
blue += vectorGauss[j]*auxazul;
}
row[w] = QColor(red,green,blue).rgba();
}
}
I dont know openmp well but you want to have a single thread per scanline, so your parallel loop need to be above the first loop. Something like
#pragma omp parallel for whatever
for (h = 0; h < height; h++){
QRgb* row;
#pragma omp critical
{
row = = (QRgb*) result->scanLine(h);
}
....
}
Another point. You can use std::inner_product to compute the color value in a single line once you have the transpose of the color inputs.
green = std::inner_product(&vectorGauss[minj], &vectorGauss[supj-1]+1, &verdescan[w-M+jmin], &verdescan[w-M+supj]+1)

Parallelism vs Threading - Performance

I have been reading on the subject, but I haven't been able to find a concrete answer to my question. I am interested in using parallelism/multithreading to improve the performance of my game, but I have heard some contradicting facts. For example, that multithreading may not produce any improvement on the execution speed for a game. I
I have thought of two ways to do this:
putting the rendering component into a thread. There are some things
I would need to change, but I have a good idea of what needs to be
done.
using openMP to parallelize the rendering function. I have already code to do so, thus this might be easier option.
This being an Uni assessment, the target hardware are my Uni's computers, which are multi-core (4 cores), and therefore I am hoping to achieve some additional efficiency using either one of those techniques.
My question, is therefore, the following: Which one should I prefer? Which normally produces the best results?
EDIT: The main function I mean to parallelize/multithread away:
void Visualization::ClipTransBlit ( int id, Vector2i spritePosition, FrameData frame, View *view )
{
const Rectangle viewRect = view->GetRect ();
BYTE *bufferPtr = view->GetBuffer ();
Texture *txt = txtMan_.GetTexture ( id );
Rectangle clippingRect = Rectangle ( 0, frame.frameSize.x, 0, frame.frameSize.y );
clippingRect.Translate ( spritePosition );
clippingRect.ClipTo ( viewRect );
Vector2i negPos ( -spritePosition.x, -spritePosition.y );
clippingRect.Translate ( negPos );
if ( spritePosition.x < viewRect.left_ ) { spritePosition.x = viewRect.left_; }
if ( spritePosition.y < viewRect.top_ ) { spritePosition.y = viewRect.top_; }
if (clippingRect.GetArea() == 0) { return; }
//clippingRect.Translate ( frameData );
BYTE *destPtr = bufferPtr + ((abs(spritePosition.x) - abs(viewRect.left_)) + (abs(spritePosition.y) - abs(viewRect.top_)) * viewRect.Width()) * 4; // corner position of the sprite (top left corner)
BYTE *tempSPtr = txt->GetData() + (clippingRect.left_ + clippingRect.top_ * txt->GetSize().x) * 4;
int w = clippingRect.Width();
int h = clippingRect.Height();
int endOfLine = (viewRect.Width() - w) * 4;
int endOfSourceLine = (txt->GetSize().x - w) * 4;
for (int i = 0; i < h; i++)
{
for (int j = 0; j < w; j++)
{
if (tempSPtr[3] != 0)
{
memcpy(destPtr, tempSPtr, 4);
}
destPtr += 4;
tempSPtr += 4;
}
destPtr += endOfLine;
tempSPtr += endOfSourceLine;
}
}
instead of calling memcpy for each pixel consider just setting the value there. the overhead in calling a function that many times could be dominating the overall execution time for this loop. E.g:
for (int i = 0; i < h; i++)
{
for (int j = 0; j < w; j++)
{
if (tempSPtr[3] != 0)
{
*((DWORD*)destPtr) = *((DWORD*)tempSPtr);
}
destPtr += 4;
tempSPtr += 4;
}
destPtr += endOfLine;
tempSPtr += endOfSourceLine;
}
you could also avoid the conditional by employing one of the tricks mentioned here avoiding conditionals - in such a tight loop conditionals can be very expensive.
edit -
as to whether it's better to run several instances of ClipTransBlit concurrently or to parallelize ClipTransBlit internally, I would say generally speaking it's better to implement parallelization at as high a level as possible to reduce the overhead you incur by setting it up (creating threads, synchronizing them, etc.)
In your case though because it looks like you're drawing sprites, if they were to overlap then without additional synchronization your high level threading might lead to nasty visual artifacts and even a race condition on checking the alpha bit. In that case the low level parallelism might be a better choice.
Theoretically, they should produce the same effect. In practice, it might be quite different.
If you print out assembly code of an OpenMP program, OpenMP simply calls some function in the scope like #pragma omp parallel .... It is similar to folk.
OpenMP is parallel computing oriented, on the other hand, multi-thread is more general.
For example, if you want to write a GUI program, multithreading is necessary(Some frameworks may hide it. It still needs multiple threads). However you never want to implement it with OpenMP.