OpenMP: parallel for doesn't do anything - c++

I'm trying to make a parallel version of SIFT algorithm in OpenCV.
In particular in sift.cpp:
static void calcDescriptors(const std::vector<Mat>& gpyr, const std::vector<KeyPoint>& keypoints,
Mat& descriptors, int nOctaveLayers, int firstOctave )
{
...
#pragma omp parallel for
for( size_t i = 0; i < keypoints.size(); i++ )
{
...
calcSIFTDescriptor(img, ptf, angle, size*0.5f, d, n, descriptors.ptr<float>((int)i));
...
}
Gives already a speed-up from 84ms to 52ms on a quad-core machine. It doesn't scale so much, but it's already a good result for adding 1 line of codes.
Anyway most of the computation inside the loop is performed by calcSIFTDescriptor(), but anyway it takes on average 100us. So most of the computation time is given by the really high number of times that calcSIFTDescriptor() is called (thousands of times). So accomulating all these 100us results in several ms.
Anyway, I'm trying to optimize the calcSIFTDescriptor() performance. In particular the code is devide between two for and the following one take on average 60us:
for( k = 0; k < len; k++ )
{
float rbin = RBin[k], cbin = CBin[k];
float obin = (Ori[k] - ori)*bins_per_rad;
float mag = Mag[k]*W[k];
int r0 = cvFloor( rbin );
int c0 = cvFloor( cbin );
int o0 = cvFloor( obin );
rbin -= r0;
cbin -= c0;
obin -= o0;
if( o0 < 0 )
o0 += n;
if( o0 >= n )
o0 -= n;
// histogram update using tri-linear interpolation
float v_r1 = mag*rbin, v_r0 = mag - v_r1;
float v_rc11 = v_r1*cbin, v_rc10 = v_r1 - v_rc11;
float v_rc01 = v_r0*cbin, v_rc00 = v_r0 - v_rc01;
float v_rco111 = v_rc11*obin, v_rco110 = v_rc11 - v_rco111;
float v_rco101 = v_rc10*obin, v_rco100 = v_rc10 - v_rco101;
float v_rco011 = v_rc01*obin, v_rco010 = v_rc01 - v_rco011;
float v_rco001 = v_rc00*obin, v_rco000 = v_rc00 - v_rco001;
int idx = ((r0+1)*(d+2) + c0+1)*(n+2) + o0;
hist[idx] += v_rco000;
hist[idx+1] += v_rco001;
hist[idx+(n+2)] += v_rco010;
hist[idx+(n+3)] += v_rco011;
hist[idx+(d+2)*(n+2)] += v_rco100;
hist[idx+(d+2)*(n+2)+1] += v_rco101;
hist[idx+(d+3)*(n+2)] += v_rco110;
hist[idx+(d+3)*(n+2)+1] += v_rco111;
}
So I tried to add #pragma omp parallel for private(k) before it, and the weird thing happens: nothing happens!!!
Introducing this parallel for make the code computation on average 53ms (against 52ms of before). I would have expected one or more of the following results:
Taking >52ms given by the overhead of a new parallel for
Taking <52ms given by the gain obtained by the parallel for
Some sort of inconsistency in the result, since as you can see the shared vector hist is updated concurrently. Nothing of this happens: the result is still correct and no atomic or critical are used.
I'm an OpenMP newbie, but from I see is like this inner parllel for is like ignored. Why this happens?
NOTE: all the reported times are the average time with the same input for 10.000 times.
UPDATE:
I tried to remove the first parallel for, leaving the one in calcSIFTDescriptor and it happened was I was expecting: inconsistency has been observed due to the lack of any thread-safety mechanism. Introducing #pragma omp critical(dataupdate) before updating hist gave consistency again but now performances are horribles: 245ms on average.
I think that this is because of the overhead given by the parallel for in calcSIFTDescriptor, which is not worth for parallelize 30us.
BUT THE QUESTION STILL REMAINS: why the first version (with two parallel for) didn't produce any change (both in performance and consistency)?

I found out the answer by myself: the second (nested) parallel for doesn't make any effect for the reason described here:
OpenMP parallel regions can be nested inside each other. If nested
parallelism is disabled, then the new team created by a thread
encountering a parallel construct inside a parallel region consists
only of the encountering thread. If nested parallelism is enabled,
then the new team may consist of more than one thread.
So since the first parallel for takes all the possible thread, the second one has as team the encountering thread itself. So nothing happens.
Cheers to myself!

Related

CUDA: Runge-Kutta trajectory on each GPU thread

Summary: How do you avoid performance loss caused by different work loads for different threads? (Kernel with a while loop on each thread)
Problem:
I want to solve particle trajectories (described by a 2nd order differential equation) using Runge-Kutta for many different initial conditions. The trajectories will generally have different lengths (each trajectory ends when a particle hits some target). Furthermore, to ensure numerical stability, the Runge-Kutta stepsize is set adaptively. This leads to two nested while-loops, with unknown number of iterations (see serial example below).
I want to implement the Runge-Kutta routine to run on a GPU with CUDA/C++. The trajectories have no dependency of each other, so as a first approach, I will just parallelize over the different initial conditions such that each thread will correspond to a unique trajectory. When a thread is done with a particle trajectory, I want it to start with a new one.
If I understand it correctly, however, the unknown length of each while loop (particle trajectory) means that different threads will get different amounts of work, which might lead to a severe performance loss on GPU.
Question: Is this possible to overcome (in a simple way) the performance losses caused by different work load for different threads? For example setting each warp size to be only 1, such that each thread(warp) can then run independently? r will this lead to other performance losses (e.g. no coalesced memory reads)?
Serial pseudo-code:
// Solve a particle trajectory for each inital condition
// N_trajectories: much larger than 1e6
for( int t_i = 0; t_i < N_trajectories; ++t_i )
{
// Set start coordinates
double x = x_init[p_i];
double y = y_init[p_i];
double vx = vx_init[p_i];
double vy = vy_init[p_i];
double stepsize = ...;
double tolerance = ...;
...
// Solve Runge-Kutta trajectory until convergence
int converged = 0;
while ( !converged )
{
// Do a Runge-Kutta step, if step-size too large then decrease it
int goodStepSize = 0
while( !goodStepSize )
{
// Update x, y, vx, vy
double error = doRungeKutta(x, y, vx, vy, stepsize);
if( error < tolerance )
goodStepSize = 1;
else
stepsize *= 0.5;
}
if( (abs(x-x_final) < epsilon) && (abs(y-y_final) < epsilon) )
converged = 1;
}
}
A short test of my code shows that the inner while-loop runs 2-4 times in 99% of all cases and >10 times in 1% of all cases, before a satisfactory Runge-Kutta step-size was found.
Parallel pseudo-code:
int tpb = 64;
int bpg = (N_trajectories + tpb-1) / tpb;
RungeKuttaKernel<<<bpg, tpb>>>( ... );
__global__ void RungeKuttaKernel( ... )
{
int idx = ...;
// Set start coordinates
double x = x_init[idx];
...
while ( !converged )
{
...
while( !goodStepSize )
{
double error = doRungeKutta( ... );
...
}
...
}
}
I will attempt to answer the question myself, until someone comes up with a better solution.
Pitfalls with directly porting the serial code:
The two while loops will lead to significant branch divergence and performance loss. The outer loop is the "full" trajectory, while the inner loop is one Runge-Kutta step with adaptive step size correction. Inner loop: If we attempt to solve Runge-Kutta with a too large step size then the approximation error will be too large, and we need to redo the step with a smaller step size until the error is smaller than our tolerance. This means that threads that need very few iterations to find an appropriate step size will have to wait for threads that need more iterations. Outer loop: this reflects how many successful Runge-Kutta steps we need before the trajectory is completed. Different trajectories will reach their target in different amount of steps. We will always have to wait for the trajectory with the most iterations before we are completely done.
Proposed parallel approach:
We notice that every iteration consists of doing one Runge-Kutta step. The branching comes from the fact that we either need to reduce the step size for the next iteration, or update the Runge-Kutta coefficients (e.g. positon/velocity) if the step size was OK. I therefore propose that we replace the two while-loops with one for-loop. The first step of the for-loop is to solve Runge-Kutta, followed by an if-statement to check if the step size was small enough or if updating the positions (and checking for total convergence). All threads will now solve only one Runge-Kutta step at a time, and we trade away low occupancy (all threads wait for the thread that need the most attempts to find the correct step size) for the cost of branch divergence of a single if-statement. In my case, solving Runge-Kutta is expensive compared with the evaluations of this if-statement, so we have made an improvement. The issue now lies in setting an appropriate limit on the for-loop and flagging the threads that need more work. This limit will set an upper bound on the longest time a finished thread has to wait for others. Pseudo-code:
int N_trajectories = 1e6;
int trajectoryStepsPerKernel = 50;
thrust::device_vector<int> isConverged(N_trajectories, 0); // Set all trajectories to unconverged
int tpb = 64;
int bpg = (N_trajectories + tpb-1) / tpb;
// Run until all trajectories are converged
while ( vectorSum(isConverged) != N_trajectories )
{
RungeKuttaKernel<<<bpg, tpb>>>( trajectoryStepsPerKernel, isConverged, ... );
cudaDeviceSynchronize();
}
__global__ void RungeKuttaKernel( ... )
{
int idx = ...;
// Set start coordinates
int converged = 0;
double x = x_init[idx];
...
for ( int i = 0; i < trajectoryStepsPerKernel; ++i )
{
double error = doRungeKutta( x_new, y_new, ... );
if( error > tolerance )
{
stepsize *= 0.5;
} else {
converged = checkConvergence( x, x_new, y, y_new, ... );
x = x_new;
y = y_new;
...
}
}
// Update start positions in case we need to continue on trajectory
isConverged[idx] = converged;
x_init[idx] = x;
y_init[idx] = y;
...
}

writting to a shared variable in different threads don't provoque data race with openMP

I am currently learning openMP basics, so I picked a simple exercise and start solving: I am given a implementation of a serial program that approximate the value of Pi and I am asked to give it's parallel implementation.
serial program :
static long num_steps = 100000;
double step;
void main ()
{
int i;
double x, pi, sum = 0.0;
step = 1.0/(double) num_steps;
for (i=0;i< num_steps; i++){
x = (i+0.5)*step;
sum = sum + 4.0/(1.0+x*x);
}
pi = step * sum;
}
it's computing an approximation to the integral of 4.0/(1+x²) dx over the interval of 0 to 1.
The tutorial is using an incremental approach, giving a little chunk in each step, so for now, I am allowed to use the parallel construct with some runtime functions.
for me the obvious thing to do is to divide and perform partial summation. Here is my solution:
int main()
{
const long num_steps = 100000;
double step;
double pi, sum = 0.0;
step = 1.0/(double) num_steps;
int num_steps_perthread = num_steps/4;
double start_time = omp_get_wtime();
#pragma omp parallel num_threads(4)
{
double x,partial_sum = 0.0;
int init = num_steps_perthread * omp_get_thread_num();
std::cout << init <<"\n";
for (int i = init;i< init+num_steps_perthread; i++){
x = (i+0.5)*step;
partial_sum += 4.0/(1.0+x*x);
}
sum += partial_sum;// this line data race
}
pi = step * sum;
double time = omp_get_wtime() - start_time;
std::cout << pi << " computed in " << time;
return 0;
}
so before asking questions, here are my assumption about openMP parallel construct(correct me if I am wrong):
any variable in the scope of the construct is a local thread variable.
any variable outside the scope is shared.
we can control the status of variables via some keyword(for now I am not aware of the exact syntax)
when I run the program I get the expected output but normally(IMO) I shouldn't, because I have written a data race(it's indicated in the code). the variable sum is written to by multiple thread.
I think that one possible scenario is, for example, thread number 2 write to sum and update it's value, but before the processor update the entire hierarchy of memory locations(level of caches and RAM) another thread(let's say thread 4) pick up the old value and update it with it's partial sum. So we won't have an addition but an overwrite.
1) sum = 0
2) thread 2 add it's partial_sum. let's say +2. sum =2 but other memory locations still hold the old value.
3) thread 4 pick the old value and add to it.
4) all memory locations of sum are updated with thread 2 is result.
5) the update of the result of thread 4 overwrite the value.
Questions:
so is My mental model correct?
or does openMP add implicit synchronizations?
are my assumptions true?
are there any implicit addition done by openMP?
Note : I am aware of work-sharing in openMP, it's just the method of the tutorial that imposed this kind of solution.

Parallelism vs Threading - Performance

I have been reading on the subject, but I haven't been able to find a concrete answer to my question. I am interested in using parallelism/multithreading to improve the performance of my game, but I have heard some contradicting facts. For example, that multithreading may not produce any improvement on the execution speed for a game. I
I have thought of two ways to do this:
putting the rendering component into a thread. There are some things
I would need to change, but I have a good idea of what needs to be
done.
using openMP to parallelize the rendering function. I have already code to do so, thus this might be easier option.
This being an Uni assessment, the target hardware are my Uni's computers, which are multi-core (4 cores), and therefore I am hoping to achieve some additional efficiency using either one of those techniques.
My question, is therefore, the following: Which one should I prefer? Which normally produces the best results?
EDIT: The main function I mean to parallelize/multithread away:
void Visualization::ClipTransBlit ( int id, Vector2i spritePosition, FrameData frame, View *view )
{
const Rectangle viewRect = view->GetRect ();
BYTE *bufferPtr = view->GetBuffer ();
Texture *txt = txtMan_.GetTexture ( id );
Rectangle clippingRect = Rectangle ( 0, frame.frameSize.x, 0, frame.frameSize.y );
clippingRect.Translate ( spritePosition );
clippingRect.ClipTo ( viewRect );
Vector2i negPos ( -spritePosition.x, -spritePosition.y );
clippingRect.Translate ( negPos );
if ( spritePosition.x < viewRect.left_ ) { spritePosition.x = viewRect.left_; }
if ( spritePosition.y < viewRect.top_ ) { spritePosition.y = viewRect.top_; }
if (clippingRect.GetArea() == 0) { return; }
//clippingRect.Translate ( frameData );
BYTE *destPtr = bufferPtr + ((abs(spritePosition.x) - abs(viewRect.left_)) + (abs(spritePosition.y) - abs(viewRect.top_)) * viewRect.Width()) * 4; // corner position of the sprite (top left corner)
BYTE *tempSPtr = txt->GetData() + (clippingRect.left_ + clippingRect.top_ * txt->GetSize().x) * 4;
int w = clippingRect.Width();
int h = clippingRect.Height();
int endOfLine = (viewRect.Width() - w) * 4;
int endOfSourceLine = (txt->GetSize().x - w) * 4;
for (int i = 0; i < h; i++)
{
for (int j = 0; j < w; j++)
{
if (tempSPtr[3] != 0)
{
memcpy(destPtr, tempSPtr, 4);
}
destPtr += 4;
tempSPtr += 4;
}
destPtr += endOfLine;
tempSPtr += endOfSourceLine;
}
}
instead of calling memcpy for each pixel consider just setting the value there. the overhead in calling a function that many times could be dominating the overall execution time for this loop. E.g:
for (int i = 0; i < h; i++)
{
for (int j = 0; j < w; j++)
{
if (tempSPtr[3] != 0)
{
*((DWORD*)destPtr) = *((DWORD*)tempSPtr);
}
destPtr += 4;
tempSPtr += 4;
}
destPtr += endOfLine;
tempSPtr += endOfSourceLine;
}
you could also avoid the conditional by employing one of the tricks mentioned here avoiding conditionals - in such a tight loop conditionals can be very expensive.
edit -
as to whether it's better to run several instances of ClipTransBlit concurrently or to parallelize ClipTransBlit internally, I would say generally speaking it's better to implement parallelization at as high a level as possible to reduce the overhead you incur by setting it up (creating threads, synchronizing them, etc.)
In your case though because it looks like you're drawing sprites, if they were to overlap then without additional synchronization your high level threading might lead to nasty visual artifacts and even a race condition on checking the alpha bit. In that case the low level parallelism might be a better choice.
Theoretically, they should produce the same effect. In practice, it might be quite different.
If you print out assembly code of an OpenMP program, OpenMP simply calls some function in the scope like #pragma omp parallel .... It is similar to folk.
OpenMP is parallel computing oriented, on the other hand, multi-thread is more general.
For example, if you want to write a GUI program, multithreading is necessary(Some frameworks may hide it. It still needs multiple threads). However you never want to implement it with OpenMP.

GPU for loops: avoid warp divergence & implicit syncthreads

My situation: each thread in a warp operates on its own completely independent & distinct data array. All threads loop over their data array. The number of loop iterations is different for each thread. (This incurs a cost, I know).
Within the for loop, each thread needs to save the maximum value after calculating three floats. After the for-loop, threads in warp will "communicate" by checking the maximum value calculated by only their "neighboring thread" in the warp (determined by parity).
Questions:
If I avoid the conditionals in a "max" operation by doing multiplication, this will avoid warp divergence, right? (see example code below)
The extra multiplication operations mentioned in (1.) are worth it, right? - i.e. far faster than any sort of warp divergence.
The same mechanism that causes warp divergence (one set of instructions for all threads) can be exploited as an implicit "thread barrier" (for the warp) at the end of the for-loop (much the same way as with an "#pragma omp for" statement in non-gpu computing). Thus I don't need to make a "syncthreads" call for a warp after the for loop before one thread checks the value saved by another thread, right? (This would be because "synthreads" is only for the "entire GPU", i.e. inter-warp and inter-MP, right?)
example code:
__shared__ int N_per_data; // loaded from host
__shared__ float ** data; //loaded from host
data = new float*[num_threads_in_warp];
for (int j = 0; j < num_threads_in_warp; ++j)
data[j] = new float[N_per_data[j]];
// the values of jagged matrix "data" are loaded from host.
__shared__ float **max_data = new float*[num_threads_in_warp];
for (int j = 0; j < num_threads_in_warp; ++j)
max_data[j] = new float[N_per_data[j]];
for (uint j = 0; j < N_per_data[threadIdx.x]; ++j)
{
const float a = f(data[threadIdx.x][j]);
const float b = g(data[threadIdx.x][j]);
const float c = h(data[threadIdx.x][j]);
const int cond_a = (a > b) && (a > c);
const int cond_b = (b > a) && (b > c);
const int cond_c = (c > a) && (c > b);
// avoid if-statements. question (1) and (2)
max_data[threadIdx.x][j] = conda_a * a + cond_b * b + cond_c * c;
}
// Question (3):
// No "syncthreads" necessary in next line:
// access data of your mate at some magic positions (assume it exists):
float my_neighbors_max_at_7 = max_data[threadIdx.x + pow(-1,(threadIdx.x % 2) == 1) ][7];
Before implementing my algorithm on a GPU, I am investigating every aspect of the algorithm to ensure that it will be worth the implementation effort. So please bear with me..
Yes
My guess would be NO - depends on how you would write the other version with the ifs.
The compiler will probably use predicates to mask out the unwanted writes, in which case there would be no real thread divergence, just a few executed but masked out write instructions.
You should let the compiler do it's magic and compare the decompiled code for both versions to determine what is the better solution.
In your particular case of calculating a maximum of signed integer d = a > b ? a : b translates to one PTX ISA instruction max.s32 so there is really no need to make it as complicated as you did... just compute the maximum into a temporary variable and do one unconditional write.
Yes, but the synthreads barrier is an intra-block barrier, not inter-block and certainly not inter-mp.

The correct usage of nested #pragma omp for directives

The following code runs like a charm before OpenMP parallelization was applied. In fact, the following code was in a state of endless loop! I'm sure that's result from my incorrect use to the OpenMP directives. Would you please show me the correct way? Thank you very much.
#pragma omp parallel for
for (int nY = nYTop; nY <= nYBottom; nY++)
{
for (int nX = nXLeft; nX <= nXRight; nX++)
{
// Use look-up table for performance
dLon = theApp.m_LonLatLUT.LonGrid()[nY][nX] + m_FavoriteSVISSRParams.m_dNadirLon;
dLat = theApp.m_LonLatLUT.LatGrid()[nY][nX];
// If you don't want to use longitude/latitude look-up table, uncomment the following line
//NOMGeoLocate.XYToGEO(dLon, dLat, nX, nY);
if (dLon > 180 || dLat > 180)
{
continue;
}
if (Navigation.GeoToXY(dX, dY, dLon, dLat, 0) > 0)
{
continue;
}
// Skip void data scanline
dY = dY - nScanlineOffset;
// Compute coefficients as well as its four neighboring points' values
nX1 = int(dX);
nX2 = nX1 + 1;
nY1 = int(dY);
nY2 = nY1 + 1;
dCx = dX - nX1;
dCy = dY - nY1;
dP1 = pIRChannelData->operator [](nY1)[nX1];
dP2 = pIRChannelData->operator [](nY1)[nX2];
dP3 = pIRChannelData->operator [](nY2)[nX1];
dP4 = pIRChannelData->operator [](nY2)[nX2];
// Bilinear interpolation
usNomDataBlock[nY][nX] = (unsigned short)BilinearInterpolation(dCx, dCy, dP1, dP2, dP3, dP4);
}
}
Don't nest it too deep. Usually, it would be enough to identify a good point for parallelization and get away with just one directive.
Some comments and probably the root of your problem:
#pragma omp parallel default(shared) // Here you open several threads ...
{
#pragma omp for
for (int nY = nYTop; nY <= nYBottom; nY++)
{
#pragma omp parallel shared(nY, nYBottom) // Same here ...
{
#pragma omp for
for (int nX = nXLeft; nX <= nXRight; nX++)
{
(Conceptually) you are opening many threads, in each of them you open many threads again in the for loop. For each thread in the for loop, you open many threads again, and for each of those, you open again many in another for loop.
That's (thread (thread)*)+ in pattern matching words; there should just be thread+
Just do a single parallel for. Don't be to fine-grained, parallelize on the outer loop, each thread should run as long as possible:
#pragma omp parallel for
for (int nY = nYTop; nY <= nYBottom; nY++)
{
for (int nX = nXLeft; nX <= nXRight; nX++)
{
}
}
Avoid data and cache sharing between the threads (another reason why the threads shouldn't be too fine grained on your data).
If that's running stable and shows good speed up, you can fine tune it with different scheduling algorithms as per your OpenMP reference card.
And put your variable declarations to where you really need them. Do not overwrite what is read by sibling threads.
You can also collapse several loops effectively. There are restrictions on loop's conditions: they must be independent. More than that not all compilers supports 'collapse' lexem. (As for gcc with OpenMP, it works.)
int i,j,k;
#pragma omp parallel for collapse(3)
for(i=0; i<=N-1; i++)
for(j=0; j<=N-1; j++)
for(k=0; k<=N-1; k++)
{
// something useful...
}
In practice, it is usually most beneficial to parallelize the out-most loop only. Parallelizing all the inner loops may give you too many threads (though OpenMP sticks with the number of hardware execution units, when not told otherwise). And more importantly - parallelizing inner loop wil most likely create and destroy threads too often, and that's an expensive operation. Your CPU will be executing threading API calls instead of your workload.
Not really an answer, but I figured I'd share the experience.
There are issues with write safety on all the variables assigned to in the inner loop. Every thread will try to assign values to the same variables, most likely you will get junk. For example, two threads may be updating dLon at the same time resulting in thread 1 passing thread 2's value into Navigation.GeoToXY(dX, dY, dLon, dLat, 0). Since you call other methods in the loop, those methods invoked on junk arguments may not terminate.
To resolve this, either declare local variables in the outer loop right after omp parallel for is applied or, use the private clauses like firstprivate to get OpenMP to automatically create local variables for each thread. In the case of firstprivate, it will copy the initialized global value. For example,
int dLon = 0;
#pragma omp parallel for firstprivate(dLon) // dLon = 0 for each thread
for (...)
{
// each thread has it's own dLon variable so there's no clash in writing
dLon = ...;
}
See more about the clauses here: https://computing.llnl.gov/tutorials/openMP/