I'm trying to use OpenMP in my Android NDK project. I have a program where I receive frames from the camera - and carry out processing on 2D matrixes. I would like to parallelize the accessing and updating of each pixel value. I have the following code, where mat is the 2D matrix local to a processFrame function, however this causes a segmentation fault:
unsigned y,x;
#pragma omp parallel shared(mat) private(y,x)
{
#pragma omp for schedule(dynamic,1) collapse(2)
//For each row
for (y = 0U; y < _height; ++y){
// For each column
for ( x = 0U; x < _width; ++x){
mat.at<Vec3b>(y, x) = Vec3b(0, 0, 0); // Background
}
}
}
At the start of the program, I am able to initialize the global matrix using OpenMP, as such:
#pragma omp parallel for schedule(dynamic,1) collapse(2)
for (unsigned y = 0U; y < height; ++y)
{
// For each column
for (unsigned x = 0U; x < width; ++x)
{
// For each channel
#pragma omp parallel for
for (unsigned i = 0U; i < 3U; ++i)
pixels[x][y].bgr[i] = pixels[x][y].mean_bgr[i] = 0.0F;
pixels[x][y].scalar = pixels[x][y].pot = pixels[x][y].mean_pot = pixels[x][y].std_pot = 0.0F;
}
}
The above code seems to work fine. Note: I make a call to omp_set_nested(1); in my main method. I'm not sure what I'm doing wrong?
UPDATE:
I think the issue lies with the Java Native Interface. Since my processFrame() function is a JNI function call, the JVM limits all processing within to the main thread only (I think). I tried wrapping the processing - i.e. have a native function called by the JNI function, which uses a clone of the Mat image frame variable, but this also fails. I think it may not be possible to use OpenMP in this context. So my new strategy is multi-thread from the Java side - split the Mat image frame into blocks, and process each in a separate thread. I'm not sure if there is an easy-to-use OpenMP equivalent library in Java, but I'll look into Java concurrency. Thanks for your input!
Related
The code works without parallelism, but when I add pragma omp parallel, it doesn't work. Furthermore, the code works perfectly with pragma omp parallel if I don't add setPixel. So, I would like to know why the parallelism doesn't work properly and exits the program with code 255 when I try to set pixel in the new image. This code wants to change an image doing two loops to change every pixel using a Gauss vector. If something can't be understood I'll solve it inmediately.
for (h = 0; h < height; h++){
QRgb* row = (QRgb*) result->scanLine(h);
//#pragma omp parallel for schedule(dynamic) num_threads(cores) private (j, auxazul, auxrojo, auxverde) reduction(+:red,green,blue)
for (w = 0; w < width; w++) {
red=green=blue=0;
minj = max((M-w),0);
supj = min((width+M-w),N);
for (j=minj; j<supj; j++){
auxazul = azul [w-M+j][h];
auxrojo = rojo [w-M+j][h];
auxverde = verde [w-M+j][h];
red += vectorGauss[j]*auxrojo;
green += vectorGauss[j]*auxverde;
blue += vectorGauss[j]*auxazul;
}
red /= 256; green /= 256; blue /= 256;
//result->setPixel(w,h,QColor(red,green,blue).rgba());
row[w] = QColor(red,green,blue).rgba();
}
QImage::setPixel is not thread safe, since it calls the detach() method (have a look at the official documentation here). Remember QImage uses implicit sharing.
Besides, setPixel() is extremely slow. If you are seeking performance (as someone usually do when dealing with parallel implementations), that's not the best way to go.
Using scanLine() as you already do in the example provided is the correct way of doing it.
Beside the comment that setPixel is slow and not thread safe, you currently have a race condition when writing the result
row[w] = QColor(red,green,blue).rgba();
Your code is slow in the first place because you are accessing your color matrices in a memory inefficient way. Pumping threads will make this part worse. Given that you loop on each scanline, you would like to have the transposee of your color matrices. Which allow you to do :
for (h = 0; h < height; h++){
QRgb* row = (QRgb*) result->scanLine(h);
auto azulscan = azul [h];
auto rojoscan = rojo [h];
auto verdescan = verde [h];
for (w = 0; w < width; w++) {
red=green=blue=0;
minj = max((M-w),0);
supj = min((width+M-w),N);
for (j=minj; j<supj; j++){
auto auxazul = azulscan [w-M+j];
auto auxrojo = rojoscan [w-M+j];
auto auxverde = verdescan [w-M+j];
red += vectorGauss[j]*auxrojo;
green += vectorGauss[j]*auxverde;
blue += vectorGauss[j]*auxazul;
}
row[w] = QColor(red,green,blue).rgba();
}
}
I dont know openmp well but you want to have a single thread per scanline, so your parallel loop need to be above the first loop. Something like
#pragma omp parallel for whatever
for (h = 0; h < height; h++){
QRgb* row;
#pragma omp critical
{
row = = (QRgb*) result->scanLine(h);
}
....
}
Another point. You can use std::inner_product to compute the color value in a single line once you have the transpose of the color inputs.
green = std::inner_product(&vectorGauss[minj], &vectorGauss[supj-1]+1, &verdescan[w-M+jmin], &verdescan[w-M+supj]+1)
I'm using Intel VTune to analyze my parallel application.
As you can see, there is an huge Spin Time at the beginning of the application (represented as the orange section on the left side):
It's more than 28% of the application durations (which is roughly 0.14 seconds)!
As you can see, these functions are _clone, start_thread, _kmp_launch_thread and _kmp_fork_barrier and they look like OpenMP internals or system calls, but it's not specified where these fucntion are called from.
In addition, if we zoom at the beginning of this section, we can notice a region instantiation, represented by the selected region:
However, I never call initInterTab2d and I have no idea if it's called by some of the labraries that I'm using (especially OpenCV).
Digging deeply and running an Advanced Hotspot analysis I found a little bit more about the firsts unkown functions:
And exaplanding tthe Function/Call Stack tab:
But again, I can't really understand why these functions, why they take so long and why only the master thread works during them, while the others are in a "barrier" state.
If you're interested, this is the link to part of the code.
Notice that I have only one #pragma omp parallel region, which is the selected section of this image (on the right side):
The code structure is the following:
Compute some serial, non parallelizable stuff. In particular, compute a chain of blurs, which is represented by gaussianBlur (included at the end of the code). cv::GaussianBlur is an OpenCV function which exploits IPP.
Start the parallel region, where 3 parallel for are used
The first one calls hessianResponse
A single thread add the results to a shared vector.
The second parallel region localfindAffineShapeArgs generates the data used by the next parallel region. The two regions can't be merged because of load imbalance.
The third region generates the final result in a balanced way.
Note: according to the lock analysis of VTune, the critical and barrier sections are not the reason of spinning.
This is the main function of the code:
void HessianDetector::detectPyramidKeypoints(const Mat &image, cv::Mat &descriptors, const AffineShapeParams ap, const SIFTDescriptorParams sp)
{
float curSigma = 0.5f;
float pixelDistance = 1.0f;
cv::Mat octaveLayer;
// prepare first octave input image
if (par.initialSigma > curSigma)
{
float sigma = sqrt(par.initialSigma * par.initialSigma - curSigma * curSigma);
octaveLayer = gaussianBlur(image, sigma);
}
// while there is sufficient size of image
int minSize = 2 * par.border + 2;
int rowsCounter = image.rows;
int colsCounter = image.cols;
float sigmaStep = pow(2.0f, 1.0f / (float) par.numberOfScales);
int levels = 0;
while (rowsCounter > minSize && colsCounter > minSize){
rowsCounter/=2; colsCounter/=2;
levels++;
}
int scaleCycles = par.numberOfScales+2;
//-------------------Shared Vectors-------------------
std::vector<Mat> blurs (scaleCycles*levels+1, Mat());
std::vector<Mat> hessResps (levels*scaleCycles+2); //+2 because high needs an extra one
std::vector<Wrapper> localWrappers;
std::vector<FindAffineShapeArgs> findAffineShapeArgs;
localWrappers.reserve(levels*(scaleCycles-2));
vector<float> pixelDistances;
pixelDistances.reserve(levels);
for(int i=0; i<levels; i++){
pixelDistances.push_back(pixelDistance);
pixelDistance*=2;
}
//compute blurs at all layers (not parallelizable)
for(int i=0; i<levels; i++){
blurs[i*scaleCycles+1] = octaveLayer.clone();
for (int j = 1; j < scaleCycles; j++){
float sigma = par.sigmas[j]* sqrt(sigmaStep * sigmaStep - 1.0f);
blurs[j+1+i*scaleCycles] = gaussianBlur(blurs[j+i*scaleCycles], sigma);
if(j == par.numberOfScales)
octaveLayer = halfImage(blurs[j+1+i*scaleCycles]);
}
}
#pragma omp parallel
{
//compute all the hessianResponses
#pragma omp for collapse(2) schedule(dynamic)
for(int i=0; i<levels; i++)
for (int j = 1; j <= scaleCycles; j++)
{
int scaleCyclesLevel = scaleCycles * i;
float curSigma = par.sigmas[j];
hessResps[j+scaleCyclesLevel] = hessianResponse(blurs[j+scaleCyclesLevel], curSigma*curSigma);
}
//we need to allocate here localWrappers to keep alive the reference for FindAffineShapeArgs
#pragma omp single
{
for(int i=0; i<levels; i++)
for (int j = 2; j < scaleCycles; j++){
int scaleCyclesLevel = scaleCycles * i;
localWrappers.push_back(Wrapper(sp, ap, hessResps[j+scaleCyclesLevel-1], hessResps[j+scaleCyclesLevel], hessResps[j+scaleCyclesLevel+1],
blurs[j+scaleCyclesLevel-1], blurs[j+scaleCyclesLevel]));
}
}
std::vector<FindAffineShapeArgs> localfindAffineShapeArgs;
#pragma omp for collapse(2) schedule(dynamic) nowait
for(int i=0; i<levels; i++)
for (int j = 2; j < scaleCycles; j++){
size_t c = (scaleCycles-2) * i +j-2;
//toDo: octaveMap is shared, need synchronization
//if(j==1)
// octaveMap = Mat::zeros(blurs[scaleCyclesLevel+1].rows, blurs[scaleCyclesLevel+1].cols, CV_8UC1);
float curSigma = par.sigmas[j];
// find keypoints in this part of octave for curLevel
findLevelKeypoints(curSigma, pixelDistances[i], localWrappers[c]);
localfindAffineShapeArgs.insert(localfindAffineShapeArgs.end(), localWrappers[c].findAffineShapeArgs.begin(), localWrappers[c].findAffineShapeArgs.end());
}
#pragma omp critical
{
findAffineShapeArgs.insert(findAffineShapeArgs.end(), localfindAffineShapeArgs.begin(), localfindAffineShapeArgs.end());
}
#pragma omp barrier
std::vector<Result> localRes;
#pragma omp for schedule(dynamic) nowait
for(int i=0; i<findAffineShapeArgs.size(); i++){
hessianKeypointCallback->onHessianKeypointDetected(findAffineShapeArgs[i], localRes);
}
#pragma omp critical
{
for(size_t i=0; i<localRes.size(); i++)
descriptors.push_back(localRes[i].descriptor);
}
}
Mat gaussianBlur(const Mat input, const float sigma)
{
Mat ret(input.rows, input.cols, input.type());
int size = (int)(2.0 * 3.0 * sigma + 1.0); if (size % 2 == 0) size++;
GaussianBlur(input, ret, Size(size, size), sigma, sigma, BORDER_REPLICATE);
return ret;
}
If you consider a 50 ms (a fraction of the blink of an eye) one time cost to be a huge overhead, then you should probably focus on your workflow as such. Try to use one fully initialized process (with it's threads and data structures) in a persistent way to increase the work done during each each run.
That said, it may be possible to reduce the overhead, but in any case you will be very dependent on the runtime and initialization cost of your library, thus limiting your performance portability.
Your performance analysis may also be problematic. AFAIK VTune uses sampling, your data indicates a 1 ms sampling interval. That means you may have just 50 samples during the critical initialization path of your application, too little for a confident analysis. VTune might also have some forms of OpenMP instrumentation that provides more accurate results at small time scales. In any case I would take any performance measurement over just 150 ms with a grain of salt unless I knew exactly what impact and method the measurement has.
P.S. Running a simple code like:
#include <stdio.h>
#include <omp.h>
int main() {
double start = omp_get_wtime();
#pragma omp parallel
{
#pragma omp barrier
#pragma omp master
printf("%f s\n", omp_get_wtime() - start);
}
}
Shows an initial thread creation overhead between 3 ms and 200 ms on different systems / thread counts with the Intel OpenMP runtime.
So I have some OpenMP code:
for(unsigned int it = 0; it < its; ++it)
{
#pragma omp parallel
{
/**
* Run the position integrator, reset the
* acceleration, update the acceleration, update the velocity.
*/
#pragma omp for schedule(dynamic, blockSize)
for(unsigned int i = 0; i < numBods; ++i)
{
Body* body = &bodies[i];
body->position += (body->velocity * timestep);
body->position += (0.5 * body->acceleration * timestep * timestep);
/**
* Update velocity for half-timestep, then reset the acceleration.
*/
body->velocity += (0.5f) * body->acceleration * timestep;
body->acceleration = Vector3();
}
/**
* Calculate the acceleration.
*/
#pragma omp for schedule(dynamic, blockSize)
for(unsigned int i = 0; i < numBods; ++i)
{
for(unsigned int j = 0; j < numBods; ++j)
{
if(j > i)
{
Body* body = &bodies[i];
Body* bodyJ = &bodies[j];
/**
* Calculating some of the subsections of the acceleration formula.
*/
Vector3 rij = bodyJ->position - body->position;
double sqrDistWithEps = rij.SqrMagnitude() + epsilon2;
double oneOverDistCubed = 1.0 / sqrt(sqrDistWithEps * sqrDistWithEps * sqrDistWithEps);
double scalar = oneOverDistCubed * gravConst;
body->acceleration += bodyJ->mass * scalar * rij;
bodyJ->acceleration -= body->mass * scalar * rij; //Newton's Third Law.
}
}
}
/**
* Velocity for the full timestep.
*/
#pragma omp for schedule(dynamic, blockSize)
for(unsigned int i = 0; i < numBods; ++i)
{
bodies[i].velocity += (0.5 * bodies[i].acceleration * timestep);
}
}
/**
* Don't want I/O to be parallel
*/
for(unsigned int index = 1; index < bodies.size(); ++index)
{
outFile << bodies[index] << std::endl;
}
}
This is fine, but I can't help but think that forking a team of threads on each iteration is a BAD IDEA. However, the iterations must happen sequentially; so I can't have the iterations themselves being parallel.
I was just wondering if there was a way to set this up to reuse the same team of threads on each iteration?
As far I know, and it is the most logical approach, the thread pool is already created and every time a thread reach a parallel constructor it will request a team of threads from the pool. Therefore, it will not create a pool of threads every time reaches a parallel region constructor, however if you want to reuse the same threads why not just push the parallel constructor out of the loop and deal with the sequential code by using the single pragma, something like this:
#pragma omp parallel
{
for(unsigned int it = 0; it < its; ++it)
{
...
...
/**
* Don't want I/O to be parallel
*/
#pragma omp single
{
for(unsigned int index = 1; index < bodies.size(); ++index)
{
outFile << bodies[index] << std::endl;
}
} // threads will wait in the internal barrier of the single
}
}
I made a quick search and the first paragraph of this answer might depend on the OpenMP implementation that you are using, I highly advice you to read the manual for the one that you are using.
Form exemple, from source:
OpenMP* is strictly a fork/join threading model. In some OpenMP
implementations, threads are created at the start of a parallel region
and destroyed at the end of the parallel region. OpenMP applications
typically have several parallel regions with intervening serial
regions. Creating and destroying threads for each parallel region can
result in significant system overhead, especially if a parallel region
is inside a loop; therefore, the Intel OpenMP implementation uses
thread pools. A pool of worker threads is created at the first
parallel region. These threads exist for the duration of program
execution. More threads may be added automatically if requested by the
program. The threads are not destroyed until the last parallel region
is executed.
Nevertheless, if you put the parallel region outside the loop you do not have to worry with the potential overhead cited in the above paragraph.
OpenMP model is often show as a fork-join paradigm. But for performance reason, threads are not killed at the end of join. In some implementation, Intel OpenMP for instance, threads are waiting on a spinlock at the end of join for a specific period before sleeping (see KMP_BLOCKTIME on https://software.intel.com/en-us/node/522775).
I have one problem, big problem =.
I Have two image (using GDIplus) and I want to compare pixel-pixel.
when pixelA = pixelB, the variable cont should be incremented.
today, I compare two equal image, my return should be 100%, but this return is 70%.
why? how can i resolve this?
see
#pragma omp parallel for schedule(dynamic)
for (int x = 0; x < height; x++){
for (int y = 0; y < width; y++){
int luma01 = 0, luma02 = 0;
Gdiplus::Color pixelColorImage01;
Gdiplus::Color pixelColorImage02;
myImage01->GetPixel(x, y, &pixelColorImage01);
luma01 = pixelColorImage01.GetRed() + pixelColorImage01.GetGreen() + pixelColorImage01.GetBlue();
myImage02->GetPixel(x, y, &pixelColorImage02);
luma02 = pixelColorImage02.GetRed() + pixelColorImage02.GetGreen() + pixelColorImage02.GetBlue();
#pragma omp critical
if (luma01 == luma02){
cont++;
}
}
}
percentage of equality between images
thanks =)
Before you parallelize your solution make sure you can solve it sequentially. In this case that means comment out the #pragma and debug that first.
First,
for (int x = 0; x < height; x++){
for (int y = 0; y < width; y++){
...
myImage01->GetPixel(x, y, &pixelColorImage01);
You transposed width and height, so you'll get a wrong answer for any image that's not square.
Second, your pixel equality metric is subject to collisions. Since you add up the individual colors' luminosities then compare that sum, it will think that, for example, an all red pixel is equal to an all blue one.
Do something like this instead:
if (red1 == red2 && green1 == green2 && blue1 == blue2)
cont++;
As for your parallelization, it's technically correct but will give you terrible performance. You put a critical section around the if, so that means if all the workers are constantly trying to acquire that lock. In other words, you've got parallel workers but each one has to wait for all the others. In other words, you've serialized your parallel code. To solve this problem look up OpenMP reducers.
I'm happy for your answers!
Adam, i switched the position of matriz. and now, I compare individual the colors!
Now, i have two little problems.
when I ran code:
see result of my code: http://postimg.org/image/8c1ophkz9/
Using #pragma omp parallel for schedule(dynamic,1024) reduction(+:cont)
Parallel time (0.763 seconds) (100% of similarity)
Sequential time (0.702 seconds) (100% of similarity)
Using #pragma omp parallel for schedule(dynamic) reduction(+:cont)
Parallel time (0.113 seconds) (66% of similarity)
Sequential time (0.703 seconds) (100% of similarity)
Image1 is equal image2.
I have many colision yet in my code =\
i need reduce time in parallel code. i dont understand why o parallel code is more slow than sequential code =\
Parallel code
#pragma omp parallel for schedule(dynamic) reduction(+:cont)
for (int x = 0; x < width; x++){
for (int y = 0; y < height; y++){
Gdiplus::Color pixelColorImage01;
Gdiplus::Color pixelColorImage02;
myImage01->GetPixel(x, y, &pixelColorImage01);
myImage02->GetPixel(x, y, &pixelColorImage02);
cont += (pixelColorImage01.GetRed() == pixelColorImage02.GetRed() && pixelColorImage01.GetGreen() == pixelColorImage02.GetGreen() && pixelColorImage01.GetBlue() == pixelColorImage02.GetBlue());
}
}
Sequential code
for (int x = 0; x < width; x++){
for (int y = 0; y < height; y++){
Gdiplus::Color pixelColorImage01;
Gdiplus::Color pixelColorImage02;
myImage01->GetPixel(x, y, &pixelColorImage01);
myImage02->GetPixel(x, y, &pixelColorImage02);
cont += (pixelColorImage01.GetRed() == pixelColorImage02.GetRed() && pixelColorImage01.GetGreen() == pixelColorImage02.GetGreen() && pixelColorImage01.GetBlue() == pixelColorImage02.GetBlue());
}
}
The following code runs like a charm before OpenMP parallelization was applied. In fact, the following code was in a state of endless loop! I'm sure that's result from my incorrect use to the OpenMP directives. Would you please show me the correct way? Thank you very much.
#pragma omp parallel for
for (int nY = nYTop; nY <= nYBottom; nY++)
{
for (int nX = nXLeft; nX <= nXRight; nX++)
{
// Use look-up table for performance
dLon = theApp.m_LonLatLUT.LonGrid()[nY][nX] + m_FavoriteSVISSRParams.m_dNadirLon;
dLat = theApp.m_LonLatLUT.LatGrid()[nY][nX];
// If you don't want to use longitude/latitude look-up table, uncomment the following line
//NOMGeoLocate.XYToGEO(dLon, dLat, nX, nY);
if (dLon > 180 || dLat > 180)
{
continue;
}
if (Navigation.GeoToXY(dX, dY, dLon, dLat, 0) > 0)
{
continue;
}
// Skip void data scanline
dY = dY - nScanlineOffset;
// Compute coefficients as well as its four neighboring points' values
nX1 = int(dX);
nX2 = nX1 + 1;
nY1 = int(dY);
nY2 = nY1 + 1;
dCx = dX - nX1;
dCy = dY - nY1;
dP1 = pIRChannelData->operator [](nY1)[nX1];
dP2 = pIRChannelData->operator [](nY1)[nX2];
dP3 = pIRChannelData->operator [](nY2)[nX1];
dP4 = pIRChannelData->operator [](nY2)[nX2];
// Bilinear interpolation
usNomDataBlock[nY][nX] = (unsigned short)BilinearInterpolation(dCx, dCy, dP1, dP2, dP3, dP4);
}
}
Don't nest it too deep. Usually, it would be enough to identify a good point for parallelization and get away with just one directive.
Some comments and probably the root of your problem:
#pragma omp parallel default(shared) // Here you open several threads ...
{
#pragma omp for
for (int nY = nYTop; nY <= nYBottom; nY++)
{
#pragma omp parallel shared(nY, nYBottom) // Same here ...
{
#pragma omp for
for (int nX = nXLeft; nX <= nXRight; nX++)
{
(Conceptually) you are opening many threads, in each of them you open many threads again in the for loop. For each thread in the for loop, you open many threads again, and for each of those, you open again many in another for loop.
That's (thread (thread)*)+ in pattern matching words; there should just be thread+
Just do a single parallel for. Don't be to fine-grained, parallelize on the outer loop, each thread should run as long as possible:
#pragma omp parallel for
for (int nY = nYTop; nY <= nYBottom; nY++)
{
for (int nX = nXLeft; nX <= nXRight; nX++)
{
}
}
Avoid data and cache sharing between the threads (another reason why the threads shouldn't be too fine grained on your data).
If that's running stable and shows good speed up, you can fine tune it with different scheduling algorithms as per your OpenMP reference card.
And put your variable declarations to where you really need them. Do not overwrite what is read by sibling threads.
You can also collapse several loops effectively. There are restrictions on loop's conditions: they must be independent. More than that not all compilers supports 'collapse' lexem. (As for gcc with OpenMP, it works.)
int i,j,k;
#pragma omp parallel for collapse(3)
for(i=0; i<=N-1; i++)
for(j=0; j<=N-1; j++)
for(k=0; k<=N-1; k++)
{
// something useful...
}
In practice, it is usually most beneficial to parallelize the out-most loop only. Parallelizing all the inner loops may give you too many threads (though OpenMP sticks with the number of hardware execution units, when not told otherwise). And more importantly - parallelizing inner loop wil most likely create and destroy threads too often, and that's an expensive operation. Your CPU will be executing threading API calls instead of your workload.
Not really an answer, but I figured I'd share the experience.
There are issues with write safety on all the variables assigned to in the inner loop. Every thread will try to assign values to the same variables, most likely you will get junk. For example, two threads may be updating dLon at the same time resulting in thread 1 passing thread 2's value into Navigation.GeoToXY(dX, dY, dLon, dLat, 0). Since you call other methods in the loop, those methods invoked on junk arguments may not terminate.
To resolve this, either declare local variables in the outer loop right after omp parallel for is applied or, use the private clauses like firstprivate to get OpenMP to automatically create local variables for each thread. In the case of firstprivate, it will copy the initialized global value. For example,
int dLon = 0;
#pragma omp parallel for firstprivate(dLon) // dLon = 0 for each thread
for (...)
{
// each thread has it's own dLon variable so there's no clash in writing
dLon = ...;
}
See more about the clauses here: https://computing.llnl.gov/tutorials/openMP/