_kmp huge overhead and spin time for unkown calls in OpenMP? - c++

I'm using Intel VTune to analyze my parallel application.
As you can see, there is an huge Spin Time at the beginning of the application (represented as the orange section on the left side):
It's more than 28% of the application durations (which is roughly 0.14 seconds)!
As you can see, these functions are _clone, start_thread, _kmp_launch_thread and _kmp_fork_barrier and they look like OpenMP internals or system calls, but it's not specified where these fucntion are called from.
In addition, if we zoom at the beginning of this section, we can notice a region instantiation, represented by the selected region:
However, I never call initInterTab2d and I have no idea if it's called by some of the labraries that I'm using (especially OpenCV).
Digging deeply and running an Advanced Hotspot analysis I found a little bit more about the firsts unkown functions:
And exaplanding tthe Function/Call Stack tab:
But again, I can't really understand why these functions, why they take so long and why only the master thread works during them, while the others are in a "barrier" state.
If you're interested, this is the link to part of the code.
Notice that I have only one #pragma omp parallel region, which is the selected section of this image (on the right side):
The code structure is the following:
Compute some serial, non parallelizable stuff. In particular, compute a chain of blurs, which is represented by gaussianBlur (included at the end of the code). cv::GaussianBlur is an OpenCV function which exploits IPP.
Start the parallel region, where 3 parallel for are used
The first one calls hessianResponse
A single thread add the results to a shared vector.
The second parallel region localfindAffineShapeArgs generates the data used by the next parallel region. The two regions can't be merged because of load imbalance.
The third region generates the final result in a balanced way.
Note: according to the lock analysis of VTune, the critical and barrier sections are not the reason of spinning.
This is the main function of the code:
void HessianDetector::detectPyramidKeypoints(const Mat &image, cv::Mat &descriptors, const AffineShapeParams ap, const SIFTDescriptorParams sp)
{
float curSigma = 0.5f;
float pixelDistance = 1.0f;
cv::Mat octaveLayer;
// prepare first octave input image
if (par.initialSigma > curSigma)
{
float sigma = sqrt(par.initialSigma * par.initialSigma - curSigma * curSigma);
octaveLayer = gaussianBlur(image, sigma);
}
// while there is sufficient size of image
int minSize = 2 * par.border + 2;
int rowsCounter = image.rows;
int colsCounter = image.cols;
float sigmaStep = pow(2.0f, 1.0f / (float) par.numberOfScales);
int levels = 0;
while (rowsCounter > minSize && colsCounter > minSize){
rowsCounter/=2; colsCounter/=2;
levels++;
}
int scaleCycles = par.numberOfScales+2;
//-------------------Shared Vectors-------------------
std::vector<Mat> blurs (scaleCycles*levels+1, Mat());
std::vector<Mat> hessResps (levels*scaleCycles+2); //+2 because high needs an extra one
std::vector<Wrapper> localWrappers;
std::vector<FindAffineShapeArgs> findAffineShapeArgs;
localWrappers.reserve(levels*(scaleCycles-2));
vector<float> pixelDistances;
pixelDistances.reserve(levels);
for(int i=0; i<levels; i++){
pixelDistances.push_back(pixelDistance);
pixelDistance*=2;
}
//compute blurs at all layers (not parallelizable)
for(int i=0; i<levels; i++){
blurs[i*scaleCycles+1] = octaveLayer.clone();
for (int j = 1; j < scaleCycles; j++){
float sigma = par.sigmas[j]* sqrt(sigmaStep * sigmaStep - 1.0f);
blurs[j+1+i*scaleCycles] = gaussianBlur(blurs[j+i*scaleCycles], sigma);
if(j == par.numberOfScales)
octaveLayer = halfImage(blurs[j+1+i*scaleCycles]);
}
}
#pragma omp parallel
{
//compute all the hessianResponses
#pragma omp for collapse(2) schedule(dynamic)
for(int i=0; i<levels; i++)
for (int j = 1; j <= scaleCycles; j++)
{
int scaleCyclesLevel = scaleCycles * i;
float curSigma = par.sigmas[j];
hessResps[j+scaleCyclesLevel] = hessianResponse(blurs[j+scaleCyclesLevel], curSigma*curSigma);
}
//we need to allocate here localWrappers to keep alive the reference for FindAffineShapeArgs
#pragma omp single
{
for(int i=0; i<levels; i++)
for (int j = 2; j < scaleCycles; j++){
int scaleCyclesLevel = scaleCycles * i;
localWrappers.push_back(Wrapper(sp, ap, hessResps[j+scaleCyclesLevel-1], hessResps[j+scaleCyclesLevel], hessResps[j+scaleCyclesLevel+1],
blurs[j+scaleCyclesLevel-1], blurs[j+scaleCyclesLevel]));
}
}
std::vector<FindAffineShapeArgs> localfindAffineShapeArgs;
#pragma omp for collapse(2) schedule(dynamic) nowait
for(int i=0; i<levels; i++)
for (int j = 2; j < scaleCycles; j++){
size_t c = (scaleCycles-2) * i +j-2;
//toDo: octaveMap is shared, need synchronization
//if(j==1)
// octaveMap = Mat::zeros(blurs[scaleCyclesLevel+1].rows, blurs[scaleCyclesLevel+1].cols, CV_8UC1);
float curSigma = par.sigmas[j];
// find keypoints in this part of octave for curLevel
findLevelKeypoints(curSigma, pixelDistances[i], localWrappers[c]);
localfindAffineShapeArgs.insert(localfindAffineShapeArgs.end(), localWrappers[c].findAffineShapeArgs.begin(), localWrappers[c].findAffineShapeArgs.end());
}
#pragma omp critical
{
findAffineShapeArgs.insert(findAffineShapeArgs.end(), localfindAffineShapeArgs.begin(), localfindAffineShapeArgs.end());
}
#pragma omp barrier
std::vector<Result> localRes;
#pragma omp for schedule(dynamic) nowait
for(int i=0; i<findAffineShapeArgs.size(); i++){
hessianKeypointCallback->onHessianKeypointDetected(findAffineShapeArgs[i], localRes);
}
#pragma omp critical
{
for(size_t i=0; i<localRes.size(); i++)
descriptors.push_back(localRes[i].descriptor);
}
}
Mat gaussianBlur(const Mat input, const float sigma)
{
Mat ret(input.rows, input.cols, input.type());
int size = (int)(2.0 * 3.0 * sigma + 1.0); if (size % 2 == 0) size++;
GaussianBlur(input, ret, Size(size, size), sigma, sigma, BORDER_REPLICATE);
return ret;
}

If you consider a 50 ms (a fraction of the blink of an eye) one time cost to be a huge overhead, then you should probably focus on your workflow as such. Try to use one fully initialized process (with it's threads and data structures) in a persistent way to increase the work done during each each run.
That said, it may be possible to reduce the overhead, but in any case you will be very dependent on the runtime and initialization cost of your library, thus limiting your performance portability.
Your performance analysis may also be problematic. AFAIK VTune uses sampling, your data indicates a 1 ms sampling interval. That means you may have just 50 samples during the critical initialization path of your application, too little for a confident analysis. VTune might also have some forms of OpenMP instrumentation that provides more accurate results at small time scales. In any case I would take any performance measurement over just 150 ms with a grain of salt unless I knew exactly what impact and method the measurement has.
P.S. Running a simple code like:
#include <stdio.h>
#include <omp.h>
int main() {
double start = omp_get_wtime();
#pragma omp parallel
{
#pragma omp barrier
#pragma omp master
printf("%f s\n", omp_get_wtime() - start);
}
}
Shows an initial thread creation overhead between 3 ms and 200 ms on different systems / thread counts with the Intel OpenMP runtime.

Related

Multi-threading molecular simulations in C++

I am developing a molecular dynamics simulation code in C++, which essentially takes atom positions and other properties as input and simulates their motion under Newton's laws of motion. The core algorithm uses what's called the Velocity Verlet scheme and looks like:
// iterate through time (k=[1,#steps])
double Dt = 0.002; // time step
double Ttot = 1.0; // total time
double halfDt = Dt/2.0;
for (int k = 1; k*Dt <= Ttot; k++){
for (int i = 0; i < number_particles; i++)
vHalf[i] = p[i].velocity + F[i]*halfDt; // step 1
for (int i = 0; i < number_particles; i++)
p[i].position += vHalf[i]*Dt; // step 2
for (int i = 0; i < number_particles; i++)
F[i] = Force(p,i); // recalculate force on all particle i's
for (int i = 0; i < number_particles; i++)
p[i].velocity = vHalf[i] + F[i]*halfDt; // step 3
}
Where p is an array of class objects which store things like particle position, velocity, mass, etc. and Force is a function that calculates the net force on a particle using something like Lennard-Jones potential.
My question regards the time required to complete the calculation; all of my subroutines are optimized in terms of crunching numbers (e.g. using x*x*x to raise to the third power instead of pow(x,3)), but the main issue is the time loop will often be performed for millions of iterations and there are typically close to a million particles. Is there any way to implement this algorithm using multi-threading? From my understanding, multi-threading essentially opens another stream of data to and from a CPU core, which would allow me to run two different simulations at the same time; I would like to use multi-threading to make just one of these simulations run faster
I'd recommend using OpenMP.
Your specific use case is trivially parallelizable.
Prallelization should be as simple as:
double Dt = 0.002; // time step
double Ttot = 1.0; // total time
double halfDt = Dt/2.0;
for (int k = 1; k*Dt <= Ttot; k++){
#pragma omp parallel for
for (int i = 0; i < number_particles; i++)
vHalf[i] = p[i].velocity + F[i]*halfDt; // step 1
p[i].position += vHalf[i]*Dt; // step 2
#pragma omp parallel for
for (int i = 0; i < number_particles; i++)
F[i] = Force(p,i); // recalculate force on all particle i's
p[i].velocity = vHalf[i] + F[i]*halfDt; // step 3
}
Most popular compilers and platforms have support for OpenMP.

Is it possible to create a team of threads, and then only "use" the threads later?

So I have some OpenMP code:
for(unsigned int it = 0; it < its; ++it)
{
#pragma omp parallel
{
/**
* Run the position integrator, reset the
* acceleration, update the acceleration, update the velocity.
*/
#pragma omp for schedule(dynamic, blockSize)
for(unsigned int i = 0; i < numBods; ++i)
{
Body* body = &bodies[i];
body->position += (body->velocity * timestep);
body->position += (0.5 * body->acceleration * timestep * timestep);
/**
* Update velocity for half-timestep, then reset the acceleration.
*/
body->velocity += (0.5f) * body->acceleration * timestep;
body->acceleration = Vector3();
}
/**
* Calculate the acceleration.
*/
#pragma omp for schedule(dynamic, blockSize)
for(unsigned int i = 0; i < numBods; ++i)
{
for(unsigned int j = 0; j < numBods; ++j)
{
if(j > i)
{
Body* body = &bodies[i];
Body* bodyJ = &bodies[j];
/**
* Calculating some of the subsections of the acceleration formula.
*/
Vector3 rij = bodyJ->position - body->position;
double sqrDistWithEps = rij.SqrMagnitude() + epsilon2;
double oneOverDistCubed = 1.0 / sqrt(sqrDistWithEps * sqrDistWithEps * sqrDistWithEps);
double scalar = oneOverDistCubed * gravConst;
body->acceleration += bodyJ->mass * scalar * rij;
bodyJ->acceleration -= body->mass * scalar * rij; //Newton's Third Law.
}
}
}
/**
* Velocity for the full timestep.
*/
#pragma omp for schedule(dynamic, blockSize)
for(unsigned int i = 0; i < numBods; ++i)
{
bodies[i].velocity += (0.5 * bodies[i].acceleration * timestep);
}
}
/**
* Don't want I/O to be parallel
*/
for(unsigned int index = 1; index < bodies.size(); ++index)
{
outFile << bodies[index] << std::endl;
}
}
This is fine, but I can't help but think that forking a team of threads on each iteration is a BAD IDEA. However, the iterations must happen sequentially; so I can't have the iterations themselves being parallel.
I was just wondering if there was a way to set this up to reuse the same team of threads on each iteration?
As far I know, and it is the most logical approach, the thread pool is already created and every time a thread reach a parallel constructor it will request a team of threads from the pool. Therefore, it will not create a pool of threads every time reaches a parallel region constructor, however if you want to reuse the same threads why not just push the parallel constructor out of the loop and deal with the sequential code by using the single pragma, something like this:
#pragma omp parallel
{
for(unsigned int it = 0; it < its; ++it)
{
...
...
/**
* Don't want I/O to be parallel
*/
#pragma omp single
{
for(unsigned int index = 1; index < bodies.size(); ++index)
{
outFile << bodies[index] << std::endl;
}
} // threads will wait in the internal barrier of the single
}
}
I made a quick search and the first paragraph of this answer might depend on the OpenMP implementation that you are using, I highly advice you to read the manual for the one that you are using.
Form exemple, from source:
OpenMP* is strictly a fork/join threading model. In some OpenMP
implementations, threads are created at the start of a parallel region
and destroyed at the end of the parallel region. OpenMP applications
typically have several parallel regions with intervening serial
regions. Creating and destroying threads for each parallel region can
result in significant system overhead, especially if a parallel region
is inside a loop; therefore, the Intel OpenMP implementation uses
thread pools. A pool of worker threads is created at the first
parallel region. These threads exist for the duration of program
execution. More threads may be added automatically if requested by the
program. The threads are not destroyed until the last parallel region
is executed.
Nevertheless, if you put the parallel region outside the loop you do not have to worry with the potential overhead cited in the above paragraph.
OpenMP model is often show as a fork-join paradigm. But for performance reason, threads are not killed at the end of join. In some implementation, Intel OpenMP for instance, threads are waiting on a spinlock at the end of join for a specific period before sleeping (see KMP_BLOCKTIME on https://software.intel.com/en-us/node/522775).

Parallel for with omp

I try to optimise the following loop with OpenMP:
#pragma omp parallel for private(diff)
for (int j = 0; j < x.d; ++j) {
diff = x(example,j) - x(chosen_pts[ndx - 1],j);
#pragma omp atomic
d2 += diff * diff;
}
But it runs actually 4x slower than without #pragma.
EDIT
As Piotr S., coincoin and erenon pointed out, in my case x.d is so small, that's why parallelism makes my code run slower. I post the outer loop too, maybe there is some possibility for multithreading: (x.n is over 100 millions)
float sum_distribution = 0.0;
// look for the point that is furthest from any center
float max_dist = 0.0;
for (int i = 0; i < x.n; ++i) {
int example = dist2[i].second;
float d2 = 0.0, diff;
//#pragma omp parallel for private(diff) reduction(+:d2)
for (int j = 0; j < x.d; ++j) {
diff = x(example,j) - x(chosen_pts[ndx - 1],j);
d2 += diff * diff;
}
if (d2 < dist2[i].first) {
dist2[i].first = d2;
}
if (dist2[i].first > max_dist) {
max_dist = dist2[i].first;
}
sum_distribution += dist2[i].first;
}
If someone is interested, here is the whole function: https://github.com/ghamerly/baylorml/blob/master/fast_kmeans/general_functions.cpp#L169, but as I measured 85% of the elapsed time comes from this loop.
Yes, the outer loop, as posted, can be parallelized with OpenMP.
All variables modified in the loop are either local to an iteration or are used for aggregation over the loop. And I assume that calls to x() in the calculation of diff have no side effects.
To do aggregation in parallel correctly and efficiently, you need to use an OpenMP loop with reduction clause. For sum_distribution the reduction operation is +, and for max_dist it's max. So, adding the following pragma in front of the outer loop should do the job:
#pragma omp parallel for reduction(+:sum_distribution) reduction(max:max_dist)
Note that max as a reduction operation can only be used since OpenMP 3.1. It's not that new, so most OpenMP-enabled compilers already support it, but not all; or you might use an older version. So it makes sense to consult with the documentation for your compiler.

OpenMP seg fault (Android NDK) when trying to parallelize nested loops

I'm trying to use OpenMP in my Android NDK project. I have a program where I receive frames from the camera - and carry out processing on 2D matrixes. I would like to parallelize the accessing and updating of each pixel value. I have the following code, where mat is the 2D matrix local to a processFrame function, however this causes a segmentation fault:
unsigned y,x;
#pragma omp parallel shared(mat) private(y,x)
{
#pragma omp for schedule(dynamic,1) collapse(2)
//For each row
for (y = 0U; y < _height; ++y){
// For each column
for ( x = 0U; x < _width; ++x){
mat.at<Vec3b>(y, x) = Vec3b(0, 0, 0); // Background
}
}
}
At the start of the program, I am able to initialize the global matrix using OpenMP, as such:
#pragma omp parallel for schedule(dynamic,1) collapse(2)
for (unsigned y = 0U; y < height; ++y)
{
// For each column
for (unsigned x = 0U; x < width; ++x)
{
// For each channel
#pragma omp parallel for
for (unsigned i = 0U; i < 3U; ++i)
pixels[x][y].bgr[i] = pixels[x][y].mean_bgr[i] = 0.0F;
pixels[x][y].scalar = pixels[x][y].pot = pixels[x][y].mean_pot = pixels[x][y].std_pot = 0.0F;
}
}
The above code seems to work fine. Note: I make a call to omp_set_nested(1); in my main method. I'm not sure what I'm doing wrong?
UPDATE:
I think the issue lies with the Java Native Interface. Since my processFrame() function is a JNI function call, the JVM limits all processing within to the main thread only (I think). I tried wrapping the processing - i.e. have a native function called by the JNI function, which uses a clone of the Mat image frame variable, but this also fails. I think it may not be possible to use OpenMP in this context. So my new strategy is multi-thread from the Java side - split the Mat image frame into blocks, and process each in a separate thread. I'm not sure if there is an easy-to-use OpenMP equivalent library in Java, but I'll look into Java concurrency. Thanks for your input!

The correct usage of nested #pragma omp for directives

The following code runs like a charm before OpenMP parallelization was applied. In fact, the following code was in a state of endless loop! I'm sure that's result from my incorrect use to the OpenMP directives. Would you please show me the correct way? Thank you very much.
#pragma omp parallel for
for (int nY = nYTop; nY <= nYBottom; nY++)
{
for (int nX = nXLeft; nX <= nXRight; nX++)
{
// Use look-up table for performance
dLon = theApp.m_LonLatLUT.LonGrid()[nY][nX] + m_FavoriteSVISSRParams.m_dNadirLon;
dLat = theApp.m_LonLatLUT.LatGrid()[nY][nX];
// If you don't want to use longitude/latitude look-up table, uncomment the following line
//NOMGeoLocate.XYToGEO(dLon, dLat, nX, nY);
if (dLon > 180 || dLat > 180)
{
continue;
}
if (Navigation.GeoToXY(dX, dY, dLon, dLat, 0) > 0)
{
continue;
}
// Skip void data scanline
dY = dY - nScanlineOffset;
// Compute coefficients as well as its four neighboring points' values
nX1 = int(dX);
nX2 = nX1 + 1;
nY1 = int(dY);
nY2 = nY1 + 1;
dCx = dX - nX1;
dCy = dY - nY1;
dP1 = pIRChannelData->operator [](nY1)[nX1];
dP2 = pIRChannelData->operator [](nY1)[nX2];
dP3 = pIRChannelData->operator [](nY2)[nX1];
dP4 = pIRChannelData->operator [](nY2)[nX2];
// Bilinear interpolation
usNomDataBlock[nY][nX] = (unsigned short)BilinearInterpolation(dCx, dCy, dP1, dP2, dP3, dP4);
}
}
Don't nest it too deep. Usually, it would be enough to identify a good point for parallelization and get away with just one directive.
Some comments and probably the root of your problem:
#pragma omp parallel default(shared) // Here you open several threads ...
{
#pragma omp for
for (int nY = nYTop; nY <= nYBottom; nY++)
{
#pragma omp parallel shared(nY, nYBottom) // Same here ...
{
#pragma omp for
for (int nX = nXLeft; nX <= nXRight; nX++)
{
(Conceptually) you are opening many threads, in each of them you open many threads again in the for loop. For each thread in the for loop, you open many threads again, and for each of those, you open again many in another for loop.
That's (thread (thread)*)+ in pattern matching words; there should just be thread+
Just do a single parallel for. Don't be to fine-grained, parallelize on the outer loop, each thread should run as long as possible:
#pragma omp parallel for
for (int nY = nYTop; nY <= nYBottom; nY++)
{
for (int nX = nXLeft; nX <= nXRight; nX++)
{
}
}
Avoid data and cache sharing between the threads (another reason why the threads shouldn't be too fine grained on your data).
If that's running stable and shows good speed up, you can fine tune it with different scheduling algorithms as per your OpenMP reference card.
And put your variable declarations to where you really need them. Do not overwrite what is read by sibling threads.
You can also collapse several loops effectively. There are restrictions on loop's conditions: they must be independent. More than that not all compilers supports 'collapse' lexem. (As for gcc with OpenMP, it works.)
int i,j,k;
#pragma omp parallel for collapse(3)
for(i=0; i<=N-1; i++)
for(j=0; j<=N-1; j++)
for(k=0; k<=N-1; k++)
{
// something useful...
}
In practice, it is usually most beneficial to parallelize the out-most loop only. Parallelizing all the inner loops may give you too many threads (though OpenMP sticks with the number of hardware execution units, when not told otherwise). And more importantly - parallelizing inner loop wil most likely create and destroy threads too often, and that's an expensive operation. Your CPU will be executing threading API calls instead of your workload.
Not really an answer, but I figured I'd share the experience.
There are issues with write safety on all the variables assigned to in the inner loop. Every thread will try to assign values to the same variables, most likely you will get junk. For example, two threads may be updating dLon at the same time resulting in thread 1 passing thread 2's value into Navigation.GeoToXY(dX, dY, dLon, dLat, 0). Since you call other methods in the loop, those methods invoked on junk arguments may not terminate.
To resolve this, either declare local variables in the outer loop right after omp parallel for is applied or, use the private clauses like firstprivate to get OpenMP to automatically create local variables for each thread. In the case of firstprivate, it will copy the initialized global value. For example,
int dLon = 0;
#pragma omp parallel for firstprivate(dLon) // dLon = 0 for each thread
for (...)
{
// each thread has it's own dLon variable so there's no clash in writing
dLon = ...;
}
See more about the clauses here: https://computing.llnl.gov/tutorials/openMP/