Qimage setPixel with openmp parallel for doesn't work - c++

The code works without parallelism, but when I add pragma omp parallel, it doesn't work. Furthermore, the code works perfectly with pragma omp parallel if I don't add setPixel. So, I would like to know why the parallelism doesn't work properly and exits the program with code 255 when I try to set pixel in the new image. This code wants to change an image doing two loops to change every pixel using a Gauss vector. If something can't be understood I'll solve it inmediately.
for (h = 0; h < height; h++){
QRgb* row = (QRgb*) result->scanLine(h);
//#pragma omp parallel for schedule(dynamic) num_threads(cores) private (j, auxazul, auxrojo, auxverde) reduction(+:red,green,blue)
for (w = 0; w < width; w++) {
red=green=blue=0;
minj = max((M-w),0);
supj = min((width+M-w),N);
for (j=minj; j<supj; j++){
auxazul = azul [w-M+j][h];
auxrojo = rojo [w-M+j][h];
auxverde = verde [w-M+j][h];
red += vectorGauss[j]*auxrojo;
green += vectorGauss[j]*auxverde;
blue += vectorGauss[j]*auxazul;
}
red /= 256; green /= 256; blue /= 256;
//result->setPixel(w,h,QColor(red,green,blue).rgba());
row[w] = QColor(red,green,blue).rgba();
}

QImage::setPixel is not thread safe, since it calls the detach() method (have a look at the official documentation here). Remember QImage uses implicit sharing.
Besides, setPixel() is extremely slow. If you are seeking performance (as someone usually do when dealing with parallel implementations), that's not the best way to go.
Using scanLine() as you already do in the example provided is the correct way of doing it.

Beside the comment that setPixel is slow and not thread safe, you currently have a race condition when writing the result
row[w] = QColor(red,green,blue).rgba();
Your code is slow in the first place because you are accessing your color matrices in a memory inefficient way. Pumping threads will make this part worse. Given that you loop on each scanline, you would like to have the transposee of your color matrices. Which allow you to do :
for (h = 0; h < height; h++){
QRgb* row = (QRgb*) result->scanLine(h);
auto azulscan = azul [h];
auto rojoscan = rojo [h];
auto verdescan = verde [h];
for (w = 0; w < width; w++) {
red=green=blue=0;
minj = max((M-w),0);
supj = min((width+M-w),N);
for (j=minj; j<supj; j++){
auto auxazul = azulscan [w-M+j];
auto auxrojo = rojoscan [w-M+j];
auto auxverde = verdescan [w-M+j];
red += vectorGauss[j]*auxrojo;
green += vectorGauss[j]*auxverde;
blue += vectorGauss[j]*auxazul;
}
row[w] = QColor(red,green,blue).rgba();
}
}
I dont know openmp well but you want to have a single thread per scanline, so your parallel loop need to be above the first loop. Something like
#pragma omp parallel for whatever
for (h = 0; h < height; h++){
QRgb* row;
#pragma omp critical
{
row = = (QRgb*) result->scanLine(h);
}
....
}
Another point. You can use std::inner_product to compute the color value in a single line once you have the transpose of the color inputs.
green = std::inner_product(&vectorGauss[minj], &vectorGauss[supj-1]+1, &verdescan[w-M+jmin], &verdescan[w-M+supj]+1)

Related

Optimizing the Vivado HLS code to reduce the latency for image processing algorithm

I am trying to implement an image processing algorithm for a gamut mapping filter for Hardware using Vivado HLS. I have created a synthesizable version from a Halide code. But it is taking way too long for an image of (256x512) it was taking around 135 seconds which shouldn't be the case. I have used some optimizing techniques like pipelining the innermost loop, By pipelining, I have set the target(initiation interval) of II=1 for the innermost loop but the acheived II is 6. From the warnings thrown by the compiler, I have understood that it is because of accessing of the weights like ctrl_pts & weights, From the tutorials, I have seen, using array partitioning and array reshaping would help with the faster accessing of the weights. I have shared the code I have used to synthesize below:
//header
include "hls_stream.h"
#include <ap_fixed.h>
//#include <ap_int.h>
#include "ap_int.h"
typedef ap_ufixed<24,24> bit_24;
typedef ap_fixed<11,8> fix;
typedef unsigned char uc;
typedef ap_uint<24> stream_width;
//typedef hls::stream<uc> Stream_t;
typedef hls::stream<stream_width> Stream_t;
struct pixel_f
{
float r;
float g;
float b;
};
struct pixel_8
{
uc r;
uc g;
uc b;
};
void gamut_transform(int rows,int cols,Stream_t& in,Stream_t& out, float ctrl_pts[3702][3],float weights[3702][3],float coefs[4][3],float num_ctrl_pts);
//core
//include the header
#include "gamut_header.h"
#include "hls_math.h"
void gamut_transform(int rows,int cols, Stream_t& in,Stream_t& out, float ctrl_pts[3702][3],float weights[3702][3],float coefs[4][3],float num_ctrl_pts)
{
#pragma HLS INTERFACE axis port=in
#pragma HLS INTERFACE axis port=out
//#pragma HLS INTERFACE fifo port=out
#pragma HLS dataflow
pixel_8 input;
pixel_8 new_pix;
bit_24 temp_in,temp_out;
pixel_f buff_1,buff_2,buff_3,buff_4,buff_5;
float dist;
for (int i = 0; i < 256; i++)
{
for (int j = 0; i < 512; i++)
{
temp_in = in.read();
input.r = (temp_in & 0xFF0000)>>16;
input.g = (temp_in & 0x00FF00)>>8;
input.b = (temp_in & 0x0000FF);
buff_1.r = ((float)input.r)/256.0;
buff_1.g = ((float)input.g)/256.0;
buff_1.b = ((float)input.b)/256.0;
for(int idx =0; idx < 3702; idx++)
{
buff_2.r = buff_1.r - ctrl_pts[idx][0];
buff_2.g = buff_1.g - ctrl_pts[idx][1];
buff_2.b = buff_1.b - ctrl_pts[idx][2];
dist = sqrt((buff_2.r*buff_2.r)+(buff_2.g*buff_2.g)+(buff_2.b*buff_2.b));
buff_3.r = buff_2.r + (weights[idx][0] * dist);
buff_3.g = buff_2.g + (weights[idx][1] * dist);
buff_3.b = buff_2.b + (weights[idx][2] * dist);
}
buff_4.r = buff_3.r + coefs[0][0] + buff_1.r* coefs[1][0] + buff_1.g * coefs[2][0] + buff_1.b* coefs[3][0];
buff_4.g = buff_3.g + coefs[0][1] + buff_1.r* coefs[1][1] + buff_1.g * coefs[2][1] + buff_1.b* coefs[3][1];
buff_4.b = buff_3.b + coefs[0][2] + buff_1.r* coefs[1][2] + buff_1.g * coefs[2][2] + buff_1.b* coefs[3][2];
buff_5.r = fmin(fmax((float)buff_4.r, 0.0), 255.0);
buff_5.g = fmin(fmax((float)buff_4.g, 0.0), 255.0);
buff_5.b = fmin(fmax((float)buff_4.b, 0.0), 255.0);
new_pix.r = (uc)buff_4.r;
new_pix.g = (uc)buff_4.g;
new_pix.b = (uc)buff_4.b;
temp_out = ((uc)new_pix.r << 16 | (uc)new_pix.g << 8 | (uc)new_pix.b);
out<<temp_out;
}
}
}
Even with the achieved II=6, the time taken is around 6 seconds; The given target is to have the time taken in milliseconds. I tried to do pipelining for the second most inner loop, but I am running out of resources on my board when I do that as the third most inner loop is being unrolled. I am using zynq ultra-scale board which has a fair amount of resources. Any suggestions on optimizing the code will be highly appreciated.
Also, can anyone suggest what type of interface would be best suited for ctrl_pts,weights and coefs, For reading the image I understood that streaming interface helps, and for reading small values like the number of rows and columns, Axi lite is preferred? Is there a type of interface that I can use for the mentioned variables so that it can go hand in hand with array partitioning and array reshaping?
Any suggestions will be highly appreciated,
Thanks in advance
Edit: I understand that the fixed-point representation can bring down the latency further, But my first goal is to get the floating-point representation with the best result and then analyzing the performance with fixed point representation
There are some steps you can do to optimize your design, but bear in mind that if you really need a floating square root operation, that will most likely have a huge latency penalty (unless properly pipelined, of course).
Your code might have a typo in the second inner loop: the index should be j right?
Data Locality
First off: ctrl_pts is read multiple time from the main memory (I assume). Since it's reused 256x512 times, it would be better to store it into a local buffer on the FPGA (like a BRAM, but it can be inferred), like so:
for(int i =0; i < 3702; i++) {
for (int j = 0; j < 3; ++j) {
#pragma HLS PIPELINE II=1
ctrl_pts_local[i][j] = ctrl_pts[i][j];
}
}
for (int i = 0; i < 256; i++) {
for (int j = 0; i < 512; i++) {
// ...
buff_2.r = buff_1.r - ctrl_pts_local[idx][0];
// ...
Same reasoning goes for coefs and weights, just store them in a local variable before running the rest of the code.
To access the arguments you can use a master AXI4 interface m_axi and configure it accordingly. Once the algorithm is dealing with the local buffers, HLS should be able to automatically partition the buffers accordingly. If not, you can place the ARRAY_PARTITION complete dim=0 pragmas to force it.
Dataflow
Because of the way your algorithm works, another thing you could try is to break down the main loops (256x512) into three smaller processes running in dataflow, and so in parallel (+3 if you include the setup ones)
The whole code will look something like this (I hope it renders correctly):
[Compute buff_1]-->[FIFO1]-->[compute buff_3]-->[FIFO2a]-->[compute buff_4 and buff_5 + stream out]
L-------------------------------->[FIFO2b]----^
One tricky thing would be to stream buff_1 to both the next processes.
Possible Code
I won't try this code, so there might be compilations errors along the way, but the whole accelerator code would look something like this:
for(int i =0; i < 3702; i++) {
for (int j = 0; j < 3; ++j) {
#pragma HLS PIPELINE II=1
ctrl_pts_local[i][j] = ctrl_pts[i][j];
weights_local[i][j] = weights[i][j];
}
}
for(int i =0; i < 4; i++) {
for (int j = 0; j < 3; ++j) {
#pragma HLS PIPELINE II=1
coefs_local[i][j] = coefs[i][j];
}
}
Process_1:
for (int i = 0; i < 256; i++) {
for (int j = 0; i < 512; i++) {
#pragma HLS PIPELINE II=1
temp_in = in.read();
input.r = (temp_in & 0xFF0000)>>16;
input.g = (temp_in & 0x00FF00)>>8;
input.b = (temp_in & 0x0000FF);
buff_1.r = ((float)input.r)/256.0;
buff_1.g = ((float)input.g)/256.0;
buff_1.b = ((float)input.b)/256.0;
fifo_1.write(buff_1); // <--- WRITE TO FIFOs
fifo_2b.write(buff_1);
}
}
Process_2:
for (int i = 0; i < 256; i++) {
for (int j = 0; i < 512; i++) {
for(int idx =0; idx < 3702; idx++) {
#pragma HLS LOOP_FLATTEN // <-- It shouldn't be necessary, since the if statements already help
#pragma HLS PIPELINE II=1 // <-- The PIPELINE directive can go here
if (idx == 0) {
buff_1 = fifo_1.read(); // <--- READ FROM FIFO
}
buff_2.r = buff_1.r - ctrl_pts_local[idx][0];
buff_2.g = buff_1.g - ctrl_pts_local[idx][1];
buff_2.b = buff_1.b - ctrl_pts_local[idx][2];
dist = sqrt((buff_2.r*buff_2.r)+(buff_2.g*buff_2.g)+(buff_2.b*buff_2.b));
buff_3.r = buff_2.r + (weights_local[idx][0] * dist);
buff_3.g = buff_2.g + (weights_local[idx][1] * dist);
buff_3.b = buff_2.b + (weights_local[idx][2] * dist);
if (idx == 3702 - 1) {
fifo_2a.write(buff_3); // <-- WRITE TO FIFO
}
}
}
}
Process_3:
for (int i = 0; i < 256; i++) {
for (int j = 0; i < 512; i++) {
#pragma HLS PIPELINE II=1
buff_3 = fifo_2a.read(); // <--- READ FROM FIFO
buff_1 = fifo_2b.read(); // <--- READ FROM FIFO
buff_4.r = buff_3.r + coefs_local[0][0] + buff_1.r* coefs_local[1][0] + buff_1.g * coefs_local[2][0] + buff_1.b* coefs[3][0];
buff_4.g = buff_3.g + coefs_local[0][1] + buff_1.r* coefs_local[1][1] + buff_1.g * coefs_local[2][1] + buff_1.b* coefs_local[3][1];
buff_4.b = buff_3.b + coefs_local[0][2] + buff_1.r* coefs_local[1][2] + buff_1.g * coefs_local[2][2] + buff_1.b* coefs_local[3][2];
buff_5.r = fmin(fmax((float)buff_4.r, 0.0), 255.0);
buff_5.g = fmin(fmax((float)buff_4.g, 0.0), 255.0);
buff_5.b = fmin(fmax((float)buff_4.b, 0.0), 255.0);
new_pix.r = (uc)buff_4.r;
new_pix.g = (uc)buff_4.g;
new_pix.b = (uc)buff_4.b;
temp_out = ((uc)new_pix.r << 16 | (uc)new_pix.g << 8 | (uc)new_pix.b);
out<<temp_out;
}
}
Be extremely careful in sizing the depth of the FIFOs, since Process 2 (the one with the sqrt operation) might have a slower data consumption and production rates! Also, FIFO 2b needs to account for that delay. If the rates are not matching, there will be deadlocks. Make sure to have a meaningful testbench and to cosimulate your design.
(The FIFO's depth can be changed with the pragma #pragma HLS STREAM variable=fifo_1 depth=N).
Final Thoughs
There might be further smaller/detailed optimizations that can be performed along the way, but I would first start from the ones above, being the heaviest. Just bear in minds that floating point processing is not optimal on FPGAs (as you noted) and is usually avoided.
EDITs: I tried the code with the modifications above and I've achieved II=1 with decent resouce usage.
Since the II is now one, the ideal number of cycles the accelerator would take is 256x512 and I'm close to that: ideal 402,653,184 versus mine 485,228,587). One crazy idea I now have to propose to you is to split the Process_2 inner-most loop into two parallel branches (or even more than 2 actually), feeding their own FIFOs. Process_1 would supply the two branches while an additional process/loop will alternatively read from the two FIFOs the 256x512 elements and supply them in the correct order to Process_3. In this way, the total amount of cycles required should halve, since Process_2 is the slowest process in the dataflow (and so improving it will improve the whole design). One possible drawback of this approach will be a higher amount of area/resource required on the FPGA.
Good luck.

Per-pixel lookup-table

I was wondering if there was another way to create a per-pixel lookup table in Opencv. What I have now works okay for small resolutions (~4 frames per second) but slow on high resolutions (less than 1 frame per second).
I have read that I can use CUDA but do not know how to use it. If it is the only way can someone point me in the right direction? Thank you.
In header file:
Mat_<Vec<uchar,256>> _GB;
Mat_<Vec<uchar,256>> _GG;
Mat_<Vec<uchar,256>> _GR;
…
Mat VideoProcessor::CamCalib(Mat _tempFrame)
{
Mat_<Vec3b> _frame = _tempFrame;
for( int i = 0; i < _tempFrame.rows; ++i)
for( int j = 0; j < _tempFrame.cols; ++j )
{
_frame(i,j)[0] = _GB(i,j)[ _frame(i,j)[0] ];
_frame(i,j)[1] = _GG(i,j)[ _frame(i,j)[1] ];
_frame(i,j)[2] = _GR(i,j)[ _frame(i,j)[2] ];
}
_tempFrame = _frame;
return _tempFrame;
}
Thanks Jerome, I followed your suggestion. A simple implementation with a 25% increase in performance. For me, since I am doing real-time image capture with software lens correction, memory is not as costly as speed.
So here is what I added:
added in Pro file:
QMAKE_CXXFLAGS += -openmp
added in CPP file:
#include "omp.h"
...
and added just before the "for loop" in method:
#pragma omp parallel for
Did some fine tuning.(Added before and after correction image at the end of post)
(1) Assuming lens spherical shape identical in all 4 quadrants, I was able to further increase the frame rate to 9fps for 1920x1080 resolution.
(2) Forced less threads reducing CPU usage from 100% to 80%
(3) Less memory used, lookup table 1/4 of the original size
Note: must be in "release" mode for openMP to function correctly! Again thank you Jerome!
Mat VideoProcessor::CamCalib(Mat _tempFrame)
{
Mat_<Vec3b> _frame = _tempFrame;
int nrows = _tempFrame.rows-1;
int ncols = _tempFrame.cols-1;
#pragma omp parallel for num_threads(2)
for( int i = 0; i < _tempFrame.rows/2; ++i)
for( int j = 0; j < _tempFrame.cols/2; ++j )
{
_frame(i,j)[0] = _GB(i,j)[_frame(i,j)[0]];
_frame(i,j)[1] = _GG(i,j)[_frame(i,j)[1]];
_frame(i,j)[2] = _GR(i,j)[_frame(i,j)[2]];
_frame(nrows-i,j)[0] = _GB(i,j)[_frame(nrows-i,j)[0]];
_frame(nrows-i,j)[1] = _GG(i,j)[_frame(nrows-i,j)[1]];
_frame(nrows-i,j)[2] = _GR(i,j)[_frame(nrows-i,j)[2]];
_frame(i,ncols-j)[0] = _GB(i,j)[_frame(i,ncols-j)[0]];
_frame(i,ncols-j)[1] = _GG(i,j)[_frame(i,ncols-j)[1]];
_frame(i,ncols-j)[2] = _GR(i,j)[_frame(i,ncols-j)[2]];
_frame(nrows-i,ncols-j)[0] = _GB(i,j)[_frame(nrows-i,ncols-j)[0]];
_frame(nrows-i,ncols-j)[1] = _GG(i,j)[_frame(nrows-i,ncols-j)[1]];
_frame(nrows-i,ncols-j)[2] = _GR(i,j)[_frame(nrows-i,ncols-j)[2]];
}
_tempFrame = _frame;
return _tempFrame;
}
Before and After correction

_kmp huge overhead and spin time for unkown calls in OpenMP?

I'm using Intel VTune to analyze my parallel application.
As you can see, there is an huge Spin Time at the beginning of the application (represented as the orange section on the left side):
It's more than 28% of the application durations (which is roughly 0.14 seconds)!
As you can see, these functions are _clone, start_thread, _kmp_launch_thread and _kmp_fork_barrier and they look like OpenMP internals or system calls, but it's not specified where these fucntion are called from.
In addition, if we zoom at the beginning of this section, we can notice a region instantiation, represented by the selected region:
However, I never call initInterTab2d and I have no idea if it's called by some of the labraries that I'm using (especially OpenCV).
Digging deeply and running an Advanced Hotspot analysis I found a little bit more about the firsts unkown functions:
And exaplanding tthe Function/Call Stack tab:
But again, I can't really understand why these functions, why they take so long and why only the master thread works during them, while the others are in a "barrier" state.
If you're interested, this is the link to part of the code.
Notice that I have only one #pragma omp parallel region, which is the selected section of this image (on the right side):
The code structure is the following:
Compute some serial, non parallelizable stuff. In particular, compute a chain of blurs, which is represented by gaussianBlur (included at the end of the code). cv::GaussianBlur is an OpenCV function which exploits IPP.
Start the parallel region, where 3 parallel for are used
The first one calls hessianResponse
A single thread add the results to a shared vector.
The second parallel region localfindAffineShapeArgs generates the data used by the next parallel region. The two regions can't be merged because of load imbalance.
The third region generates the final result in a balanced way.
Note: according to the lock analysis of VTune, the critical and barrier sections are not the reason of spinning.
This is the main function of the code:
void HessianDetector::detectPyramidKeypoints(const Mat &image, cv::Mat &descriptors, const AffineShapeParams ap, const SIFTDescriptorParams sp)
{
float curSigma = 0.5f;
float pixelDistance = 1.0f;
cv::Mat octaveLayer;
// prepare first octave input image
if (par.initialSigma > curSigma)
{
float sigma = sqrt(par.initialSigma * par.initialSigma - curSigma * curSigma);
octaveLayer = gaussianBlur(image, sigma);
}
// while there is sufficient size of image
int minSize = 2 * par.border + 2;
int rowsCounter = image.rows;
int colsCounter = image.cols;
float sigmaStep = pow(2.0f, 1.0f / (float) par.numberOfScales);
int levels = 0;
while (rowsCounter > minSize && colsCounter > minSize){
rowsCounter/=2; colsCounter/=2;
levels++;
}
int scaleCycles = par.numberOfScales+2;
//-------------------Shared Vectors-------------------
std::vector<Mat> blurs (scaleCycles*levels+1, Mat());
std::vector<Mat> hessResps (levels*scaleCycles+2); //+2 because high needs an extra one
std::vector<Wrapper> localWrappers;
std::vector<FindAffineShapeArgs> findAffineShapeArgs;
localWrappers.reserve(levels*(scaleCycles-2));
vector<float> pixelDistances;
pixelDistances.reserve(levels);
for(int i=0; i<levels; i++){
pixelDistances.push_back(pixelDistance);
pixelDistance*=2;
}
//compute blurs at all layers (not parallelizable)
for(int i=0; i<levels; i++){
blurs[i*scaleCycles+1] = octaveLayer.clone();
for (int j = 1; j < scaleCycles; j++){
float sigma = par.sigmas[j]* sqrt(sigmaStep * sigmaStep - 1.0f);
blurs[j+1+i*scaleCycles] = gaussianBlur(blurs[j+i*scaleCycles], sigma);
if(j == par.numberOfScales)
octaveLayer = halfImage(blurs[j+1+i*scaleCycles]);
}
}
#pragma omp parallel
{
//compute all the hessianResponses
#pragma omp for collapse(2) schedule(dynamic)
for(int i=0; i<levels; i++)
for (int j = 1; j <= scaleCycles; j++)
{
int scaleCyclesLevel = scaleCycles * i;
float curSigma = par.sigmas[j];
hessResps[j+scaleCyclesLevel] = hessianResponse(blurs[j+scaleCyclesLevel], curSigma*curSigma);
}
//we need to allocate here localWrappers to keep alive the reference for FindAffineShapeArgs
#pragma omp single
{
for(int i=0; i<levels; i++)
for (int j = 2; j < scaleCycles; j++){
int scaleCyclesLevel = scaleCycles * i;
localWrappers.push_back(Wrapper(sp, ap, hessResps[j+scaleCyclesLevel-1], hessResps[j+scaleCyclesLevel], hessResps[j+scaleCyclesLevel+1],
blurs[j+scaleCyclesLevel-1], blurs[j+scaleCyclesLevel]));
}
}
std::vector<FindAffineShapeArgs> localfindAffineShapeArgs;
#pragma omp for collapse(2) schedule(dynamic) nowait
for(int i=0; i<levels; i++)
for (int j = 2; j < scaleCycles; j++){
size_t c = (scaleCycles-2) * i +j-2;
//toDo: octaveMap is shared, need synchronization
//if(j==1)
// octaveMap = Mat::zeros(blurs[scaleCyclesLevel+1].rows, blurs[scaleCyclesLevel+1].cols, CV_8UC1);
float curSigma = par.sigmas[j];
// find keypoints in this part of octave for curLevel
findLevelKeypoints(curSigma, pixelDistances[i], localWrappers[c]);
localfindAffineShapeArgs.insert(localfindAffineShapeArgs.end(), localWrappers[c].findAffineShapeArgs.begin(), localWrappers[c].findAffineShapeArgs.end());
}
#pragma omp critical
{
findAffineShapeArgs.insert(findAffineShapeArgs.end(), localfindAffineShapeArgs.begin(), localfindAffineShapeArgs.end());
}
#pragma omp barrier
std::vector<Result> localRes;
#pragma omp for schedule(dynamic) nowait
for(int i=0; i<findAffineShapeArgs.size(); i++){
hessianKeypointCallback->onHessianKeypointDetected(findAffineShapeArgs[i], localRes);
}
#pragma omp critical
{
for(size_t i=0; i<localRes.size(); i++)
descriptors.push_back(localRes[i].descriptor);
}
}
Mat gaussianBlur(const Mat input, const float sigma)
{
Mat ret(input.rows, input.cols, input.type());
int size = (int)(2.0 * 3.0 * sigma + 1.0); if (size % 2 == 0) size++;
GaussianBlur(input, ret, Size(size, size), sigma, sigma, BORDER_REPLICATE);
return ret;
}
If you consider a 50 ms (a fraction of the blink of an eye) one time cost to be a huge overhead, then you should probably focus on your workflow as such. Try to use one fully initialized process (with it's threads and data structures) in a persistent way to increase the work done during each each run.
That said, it may be possible to reduce the overhead, but in any case you will be very dependent on the runtime and initialization cost of your library, thus limiting your performance portability.
Your performance analysis may also be problematic. AFAIK VTune uses sampling, your data indicates a 1 ms sampling interval. That means you may have just 50 samples during the critical initialization path of your application, too little for a confident analysis. VTune might also have some forms of OpenMP instrumentation that provides more accurate results at small time scales. In any case I would take any performance measurement over just 150 ms with a grain of salt unless I knew exactly what impact and method the measurement has.
P.S. Running a simple code like:
#include <stdio.h>
#include <omp.h>
int main() {
double start = omp_get_wtime();
#pragma omp parallel
{
#pragma omp barrier
#pragma omp master
printf("%f s\n", omp_get_wtime() - start);
}
}
Shows an initial thread creation overhead between 3 ms and 200 ms on different systems / thread counts with the Intel OpenMP runtime.

OpenMP seg fault (Android NDK) when trying to parallelize nested loops

I'm trying to use OpenMP in my Android NDK project. I have a program where I receive frames from the camera - and carry out processing on 2D matrixes. I would like to parallelize the accessing and updating of each pixel value. I have the following code, where mat is the 2D matrix local to a processFrame function, however this causes a segmentation fault:
unsigned y,x;
#pragma omp parallel shared(mat) private(y,x)
{
#pragma omp for schedule(dynamic,1) collapse(2)
//For each row
for (y = 0U; y < _height; ++y){
// For each column
for ( x = 0U; x < _width; ++x){
mat.at<Vec3b>(y, x) = Vec3b(0, 0, 0); // Background
}
}
}
At the start of the program, I am able to initialize the global matrix using OpenMP, as such:
#pragma omp parallel for schedule(dynamic,1) collapse(2)
for (unsigned y = 0U; y < height; ++y)
{
// For each column
for (unsigned x = 0U; x < width; ++x)
{
// For each channel
#pragma omp parallel for
for (unsigned i = 0U; i < 3U; ++i)
pixels[x][y].bgr[i] = pixels[x][y].mean_bgr[i] = 0.0F;
pixels[x][y].scalar = pixels[x][y].pot = pixels[x][y].mean_pot = pixels[x][y].std_pot = 0.0F;
}
}
The above code seems to work fine. Note: I make a call to omp_set_nested(1); in my main method. I'm not sure what I'm doing wrong?
UPDATE:
I think the issue lies with the Java Native Interface. Since my processFrame() function is a JNI function call, the JVM limits all processing within to the main thread only (I think). I tried wrapping the processing - i.e. have a native function called by the JNI function, which uses a clone of the Mat image frame variable, but this also fails. I think it may not be possible to use OpenMP in this context. So my new strategy is multi-thread from the Java side - split the Mat image frame into blocks, and process each in a separate thread. I'm not sure if there is an easy-to-use OpenMP equivalent library in Java, but I'll look into Java concurrency. Thanks for your input!

OpenMP/C++ How to increment a variable in parallel for?

I have one problem, big problem =.
I Have two image (using GDIplus) and I want to compare pixel-pixel.
when pixelA = pixelB, the variable cont should be incremented.
today, I compare two equal image, my return should be 100%, but this return is 70%.
why? how can i resolve this?
see
#pragma omp parallel for schedule(dynamic)
for (int x = 0; x < height; x++){
for (int y = 0; y < width; y++){
int luma01 = 0, luma02 = 0;
Gdiplus::Color pixelColorImage01;
Gdiplus::Color pixelColorImage02;
myImage01->GetPixel(x, y, &pixelColorImage01);
luma01 = pixelColorImage01.GetRed() + pixelColorImage01.GetGreen() + pixelColorImage01.GetBlue();
myImage02->GetPixel(x, y, &pixelColorImage02);
luma02 = pixelColorImage02.GetRed() + pixelColorImage02.GetGreen() + pixelColorImage02.GetBlue();
#pragma omp critical
if (luma01 == luma02){
cont++;
}
}
}
percentage of equality between images
thanks =)
Before you parallelize your solution make sure you can solve it sequentially. In this case that means comment out the #pragma and debug that first.
First,
for (int x = 0; x < height; x++){
for (int y = 0; y < width; y++){
...
myImage01->GetPixel(x, y, &pixelColorImage01);
You transposed width and height, so you'll get a wrong answer for any image that's not square.
Second, your pixel equality metric is subject to collisions. Since you add up the individual colors' luminosities then compare that sum, it will think that, for example, an all red pixel is equal to an all blue one.
Do something like this instead:
if (red1 == red2 && green1 == green2 && blue1 == blue2)
cont++;
As for your parallelization, it's technically correct but will give you terrible performance. You put a critical section around the if, so that means if all the workers are constantly trying to acquire that lock. In other words, you've got parallel workers but each one has to wait for all the others. In other words, you've serialized your parallel code. To solve this problem look up OpenMP reducers.
I'm happy for your answers!
Adam, i switched the position of matriz. and now, I compare individual the colors!
Now, i have two little problems.
when I ran code:
see result of my code: http://postimg.org/image/8c1ophkz9/
Using #pragma omp parallel for schedule(dynamic,1024) reduction(+:cont)
Parallel time (0.763 seconds) (100% of similarity)
Sequential time (0.702 seconds) (100% of similarity)
Using #pragma omp parallel for schedule(dynamic) reduction(+:cont)
Parallel time (0.113 seconds) (66% of similarity)
Sequential time (0.703 seconds) (100% of similarity)
Image1 is equal image2.
I have many colision yet in my code =\
i need reduce time in parallel code. i dont understand why o parallel code is more slow than sequential code =\
Parallel code
#pragma omp parallel for schedule(dynamic) reduction(+:cont)
for (int x = 0; x < width; x++){
for (int y = 0; y < height; y++){
Gdiplus::Color pixelColorImage01;
Gdiplus::Color pixelColorImage02;
myImage01->GetPixel(x, y, &pixelColorImage01);
myImage02->GetPixel(x, y, &pixelColorImage02);
cont += (pixelColorImage01.GetRed() == pixelColorImage02.GetRed() && pixelColorImage01.GetGreen() == pixelColorImage02.GetGreen() && pixelColorImage01.GetBlue() == pixelColorImage02.GetBlue());
}
}
Sequential code
for (int x = 0; x < width; x++){
for (int y = 0; y < height; y++){
Gdiplus::Color pixelColorImage01;
Gdiplus::Color pixelColorImage02;
myImage01->GetPixel(x, y, &pixelColorImage01);
myImage02->GetPixel(x, y, &pixelColorImage02);
cont += (pixelColorImage01.GetRed() == pixelColorImage02.GetRed() && pixelColorImage01.GetGreen() == pixelColorImage02.GetGreen() && pixelColorImage01.GetBlue() == pixelColorImage02.GetBlue());
}
}