OpenMP/C++ How to increment a variable in parallel for? - c++

I have one problem, big problem =.
I Have two image (using GDIplus) and I want to compare pixel-pixel.
when pixelA = pixelB, the variable cont should be incremented.
today, I compare two equal image, my return should be 100%, but this return is 70%.
why? how can i resolve this?
see
#pragma omp parallel for schedule(dynamic)
for (int x = 0; x < height; x++){
for (int y = 0; y < width; y++){
int luma01 = 0, luma02 = 0;
Gdiplus::Color pixelColorImage01;
Gdiplus::Color pixelColorImage02;
myImage01->GetPixel(x, y, &pixelColorImage01);
luma01 = pixelColorImage01.GetRed() + pixelColorImage01.GetGreen() + pixelColorImage01.GetBlue();
myImage02->GetPixel(x, y, &pixelColorImage02);
luma02 = pixelColorImage02.GetRed() + pixelColorImage02.GetGreen() + pixelColorImage02.GetBlue();
#pragma omp critical
if (luma01 == luma02){
cont++;
}
}
}
percentage of equality between images
thanks =)

Before you parallelize your solution make sure you can solve it sequentially. In this case that means comment out the #pragma and debug that first.
First,
for (int x = 0; x < height; x++){
for (int y = 0; y < width; y++){
...
myImage01->GetPixel(x, y, &pixelColorImage01);
You transposed width and height, so you'll get a wrong answer for any image that's not square.
Second, your pixel equality metric is subject to collisions. Since you add up the individual colors' luminosities then compare that sum, it will think that, for example, an all red pixel is equal to an all blue one.
Do something like this instead:
if (red1 == red2 && green1 == green2 && blue1 == blue2)
cont++;
As for your parallelization, it's technically correct but will give you terrible performance. You put a critical section around the if, so that means if all the workers are constantly trying to acquire that lock. In other words, you've got parallel workers but each one has to wait for all the others. In other words, you've serialized your parallel code. To solve this problem look up OpenMP reducers.

I'm happy for your answers!
Adam, i switched the position of matriz. and now, I compare individual the colors!
Now, i have two little problems.
when I ran code:
see result of my code: http://postimg.org/image/8c1ophkz9/
Using #pragma omp parallel for schedule(dynamic,1024) reduction(+:cont)
Parallel time (0.763 seconds) (100% of similarity)
Sequential time (0.702 seconds) (100% of similarity)
Using #pragma omp parallel for schedule(dynamic) reduction(+:cont)
Parallel time (0.113 seconds) (66% of similarity)
Sequential time (0.703 seconds) (100% of similarity)
Image1 is equal image2.
I have many colision yet in my code =\
i need reduce time in parallel code. i dont understand why o parallel code is more slow than sequential code =\
Parallel code
#pragma omp parallel for schedule(dynamic) reduction(+:cont)
for (int x = 0; x < width; x++){
for (int y = 0; y < height; y++){
Gdiplus::Color pixelColorImage01;
Gdiplus::Color pixelColorImage02;
myImage01->GetPixel(x, y, &pixelColorImage01);
myImage02->GetPixel(x, y, &pixelColorImage02);
cont += (pixelColorImage01.GetRed() == pixelColorImage02.GetRed() && pixelColorImage01.GetGreen() == pixelColorImage02.GetGreen() && pixelColorImage01.GetBlue() == pixelColorImage02.GetBlue());
}
}
Sequential code
for (int x = 0; x < width; x++){
for (int y = 0; y < height; y++){
Gdiplus::Color pixelColorImage01;
Gdiplus::Color pixelColorImage02;
myImage01->GetPixel(x, y, &pixelColorImage01);
myImage02->GetPixel(x, y, &pixelColorImage02);
cont += (pixelColorImage01.GetRed() == pixelColorImage02.GetRed() && pixelColorImage01.GetGreen() == pixelColorImage02.GetGreen() && pixelColorImage01.GetBlue() == pixelColorImage02.GetBlue());
}
}

Related

Multi-threading molecular simulations in C++

I am developing a molecular dynamics simulation code in C++, which essentially takes atom positions and other properties as input and simulates their motion under Newton's laws of motion. The core algorithm uses what's called the Velocity Verlet scheme and looks like:
// iterate through time (k=[1,#steps])
double Dt = 0.002; // time step
double Ttot = 1.0; // total time
double halfDt = Dt/2.0;
for (int k = 1; k*Dt <= Ttot; k++){
for (int i = 0; i < number_particles; i++)
vHalf[i] = p[i].velocity + F[i]*halfDt; // step 1
for (int i = 0; i < number_particles; i++)
p[i].position += vHalf[i]*Dt; // step 2
for (int i = 0; i < number_particles; i++)
F[i] = Force(p,i); // recalculate force on all particle i's
for (int i = 0; i < number_particles; i++)
p[i].velocity = vHalf[i] + F[i]*halfDt; // step 3
}
Where p is an array of class objects which store things like particle position, velocity, mass, etc. and Force is a function that calculates the net force on a particle using something like Lennard-Jones potential.
My question regards the time required to complete the calculation; all of my subroutines are optimized in terms of crunching numbers (e.g. using x*x*x to raise to the third power instead of pow(x,3)), but the main issue is the time loop will often be performed for millions of iterations and there are typically close to a million particles. Is there any way to implement this algorithm using multi-threading? From my understanding, multi-threading essentially opens another stream of data to and from a CPU core, which would allow me to run two different simulations at the same time; I would like to use multi-threading to make just one of these simulations run faster
I'd recommend using OpenMP.
Your specific use case is trivially parallelizable.
Prallelization should be as simple as:
double Dt = 0.002; // time step
double Ttot = 1.0; // total time
double halfDt = Dt/2.0;
for (int k = 1; k*Dt <= Ttot; k++){
#pragma omp parallel for
for (int i = 0; i < number_particles; i++)
vHalf[i] = p[i].velocity + F[i]*halfDt; // step 1
p[i].position += vHalf[i]*Dt; // step 2
#pragma omp parallel for
for (int i = 0; i < number_particles; i++)
F[i] = Force(p,i); // recalculate force on all particle i's
p[i].velocity = vHalf[i] + F[i]*halfDt; // step 3
}
Most popular compilers and platforms have support for OpenMP.

How to create a Minecraft chunk in opengl?

I'm learning OpenGL and I have tried to make a voxel game like Minecraft. In the beginning, everything was good. I created a CubeRenderer class to render a cube with its position. The below picture is what I have done.
https://imgur.com/yqn783x
And then I got a serious problem when I try to create a large terrain, I hit a slowing performance. It was very slow and fps just around 15fps, I thought.
Next, I figured out Minecraft chunk and culling face algorithm can solve slowing performance by dividing the world map into small pieces like chunk and just rendering visible faces of a cube. So how to create a chunk in the right way and how the culling face algorithm is applied in Chunk?
So far, that is what I have tried
I read about Chunk in Minecraft at https://minecraft.gamepedia.com/Chunk
I created a demo Chunk by this below code (it is not the completed code because I removed it out)
I created a CubeData that contains cube position and cube type.
And I call the GenerateTerrain function to make a simple chunk data (16x16x16) like below (CHUNK_SIZE is 16)
for (int x = 0; x < CHUNK_SIZE; x++) {
for (int y = 0; y < CHUNK_SIZE; y++) {
for (int z = 0; z < CHUNK_SIZE; z++) {
CubeType cubeType = { GRASS_BLOCK };
Location cubeLocation = { x, y, z };
CubeData cubeData = { cubeLocation, cubeType };
this->Cubes[x][y][z] = cubeData;
}
}
}
After that, I had a boolean array which is called "mask" contains two values are 0 (not visible) or 1 (visible) and matches with their cube data. And then I call Render function of Chunk class to render a chunk. This code below like what I have done (but it is not complete code because I removed that code and replaced with new code)
for (int x = 0; x < CHUNK_SIZE; x++) {
for (int y = 0; y < CHUNK_SIZE; y++) {
for (int z = 0; z < CHUNK_SIZE; z++) {
for(int side = 0; side < 6;side++){
if(this->mask[x][y][z][side] == true) cubeRenderer.Render(cubeData[x][y][z]);
}
}
}
}
But the result I got that everything still slow (but it is better than the first fps, from 15fps up to 25-30fps, maybe)
I guess it is not gpu problem, it is a cpu problem because there is too many loops in render call.
So I have kept research because I think my approach was wrong. There may have some right way to create a chunk, right?
So I found the solution that puts every visible verticle to one VBO. So I just have to call and bind VBO definitely one time.
So this below code show what I have tried
cout << "Generating Terrain..." << endl;
for (int side = 0; side < 6; side++) {
for (int x = 0; x < CHUNK_SIZE; x++) {
for (int y = 0; y < CHUNK_SIZE; y++) {
for (int z = 0; z < CHUNK_SIZE; z++) {
if (this->isVisibleSide(x, y, z, side) == true) {
this->cubeRenderer.AddVerticleToVBO(this->getCubeSide(side), glm::vec3(x, y, z), this->getTexCoord(this->Cubes[x][y][z].cubeType, side));
}
}
}
}
}
this->cubeRenderer.GenerateVBO();
And call render one time at all.
void CubeChunk::Update()
{
this->cubeRenderer.Render(); // with VBO data have already init above
}
And I got this:
https://imgur.com/YqsrtPP
I think my way was wrong.
So what should I do to create a chunk? Any suggestion?

Qimage setPixel with openmp parallel for doesn't work

The code works without parallelism, but when I add pragma omp parallel, it doesn't work. Furthermore, the code works perfectly with pragma omp parallel if I don't add setPixel. So, I would like to know why the parallelism doesn't work properly and exits the program with code 255 when I try to set pixel in the new image. This code wants to change an image doing two loops to change every pixel using a Gauss vector. If something can't be understood I'll solve it inmediately.
for (h = 0; h < height; h++){
QRgb* row = (QRgb*) result->scanLine(h);
//#pragma omp parallel for schedule(dynamic) num_threads(cores) private (j, auxazul, auxrojo, auxverde) reduction(+:red,green,blue)
for (w = 0; w < width; w++) {
red=green=blue=0;
minj = max((M-w),0);
supj = min((width+M-w),N);
for (j=minj; j<supj; j++){
auxazul = azul [w-M+j][h];
auxrojo = rojo [w-M+j][h];
auxverde = verde [w-M+j][h];
red += vectorGauss[j]*auxrojo;
green += vectorGauss[j]*auxverde;
blue += vectorGauss[j]*auxazul;
}
red /= 256; green /= 256; blue /= 256;
//result->setPixel(w,h,QColor(red,green,blue).rgba());
row[w] = QColor(red,green,blue).rgba();
}
QImage::setPixel is not thread safe, since it calls the detach() method (have a look at the official documentation here). Remember QImage uses implicit sharing.
Besides, setPixel() is extremely slow. If you are seeking performance (as someone usually do when dealing with parallel implementations), that's not the best way to go.
Using scanLine() as you already do in the example provided is the correct way of doing it.
Beside the comment that setPixel is slow and not thread safe, you currently have a race condition when writing the result
row[w] = QColor(red,green,blue).rgba();
Your code is slow in the first place because you are accessing your color matrices in a memory inefficient way. Pumping threads will make this part worse. Given that you loop on each scanline, you would like to have the transposee of your color matrices. Which allow you to do :
for (h = 0; h < height; h++){
QRgb* row = (QRgb*) result->scanLine(h);
auto azulscan = azul [h];
auto rojoscan = rojo [h];
auto verdescan = verde [h];
for (w = 0; w < width; w++) {
red=green=blue=0;
minj = max((M-w),0);
supj = min((width+M-w),N);
for (j=minj; j<supj; j++){
auto auxazul = azulscan [w-M+j];
auto auxrojo = rojoscan [w-M+j];
auto auxverde = verdescan [w-M+j];
red += vectorGauss[j]*auxrojo;
green += vectorGauss[j]*auxverde;
blue += vectorGauss[j]*auxazul;
}
row[w] = QColor(red,green,blue).rgba();
}
}
I dont know openmp well but you want to have a single thread per scanline, so your parallel loop need to be above the first loop. Something like
#pragma omp parallel for whatever
for (h = 0; h < height; h++){
QRgb* row;
#pragma omp critical
{
row = = (QRgb*) result->scanLine(h);
}
....
}
Another point. You can use std::inner_product to compute the color value in a single line once you have the transpose of the color inputs.
green = std::inner_product(&vectorGauss[minj], &vectorGauss[supj-1]+1, &verdescan[w-M+jmin], &verdescan[w-M+supj]+1)

_kmp huge overhead and spin time for unkown calls in OpenMP?

I'm using Intel VTune to analyze my parallel application.
As you can see, there is an huge Spin Time at the beginning of the application (represented as the orange section on the left side):
It's more than 28% of the application durations (which is roughly 0.14 seconds)!
As you can see, these functions are _clone, start_thread, _kmp_launch_thread and _kmp_fork_barrier and they look like OpenMP internals or system calls, but it's not specified where these fucntion are called from.
In addition, if we zoom at the beginning of this section, we can notice a region instantiation, represented by the selected region:
However, I never call initInterTab2d and I have no idea if it's called by some of the labraries that I'm using (especially OpenCV).
Digging deeply and running an Advanced Hotspot analysis I found a little bit more about the firsts unkown functions:
And exaplanding tthe Function/Call Stack tab:
But again, I can't really understand why these functions, why they take so long and why only the master thread works during them, while the others are in a "barrier" state.
If you're interested, this is the link to part of the code.
Notice that I have only one #pragma omp parallel region, which is the selected section of this image (on the right side):
The code structure is the following:
Compute some serial, non parallelizable stuff. In particular, compute a chain of blurs, which is represented by gaussianBlur (included at the end of the code). cv::GaussianBlur is an OpenCV function which exploits IPP.
Start the parallel region, where 3 parallel for are used
The first one calls hessianResponse
A single thread add the results to a shared vector.
The second parallel region localfindAffineShapeArgs generates the data used by the next parallel region. The two regions can't be merged because of load imbalance.
The third region generates the final result in a balanced way.
Note: according to the lock analysis of VTune, the critical and barrier sections are not the reason of spinning.
This is the main function of the code:
void HessianDetector::detectPyramidKeypoints(const Mat &image, cv::Mat &descriptors, const AffineShapeParams ap, const SIFTDescriptorParams sp)
{
float curSigma = 0.5f;
float pixelDistance = 1.0f;
cv::Mat octaveLayer;
// prepare first octave input image
if (par.initialSigma > curSigma)
{
float sigma = sqrt(par.initialSigma * par.initialSigma - curSigma * curSigma);
octaveLayer = gaussianBlur(image, sigma);
}
// while there is sufficient size of image
int minSize = 2 * par.border + 2;
int rowsCounter = image.rows;
int colsCounter = image.cols;
float sigmaStep = pow(2.0f, 1.0f / (float) par.numberOfScales);
int levels = 0;
while (rowsCounter > minSize && colsCounter > minSize){
rowsCounter/=2; colsCounter/=2;
levels++;
}
int scaleCycles = par.numberOfScales+2;
//-------------------Shared Vectors-------------------
std::vector<Mat> blurs (scaleCycles*levels+1, Mat());
std::vector<Mat> hessResps (levels*scaleCycles+2); //+2 because high needs an extra one
std::vector<Wrapper> localWrappers;
std::vector<FindAffineShapeArgs> findAffineShapeArgs;
localWrappers.reserve(levels*(scaleCycles-2));
vector<float> pixelDistances;
pixelDistances.reserve(levels);
for(int i=0; i<levels; i++){
pixelDistances.push_back(pixelDistance);
pixelDistance*=2;
}
//compute blurs at all layers (not parallelizable)
for(int i=0; i<levels; i++){
blurs[i*scaleCycles+1] = octaveLayer.clone();
for (int j = 1; j < scaleCycles; j++){
float sigma = par.sigmas[j]* sqrt(sigmaStep * sigmaStep - 1.0f);
blurs[j+1+i*scaleCycles] = gaussianBlur(blurs[j+i*scaleCycles], sigma);
if(j == par.numberOfScales)
octaveLayer = halfImage(blurs[j+1+i*scaleCycles]);
}
}
#pragma omp parallel
{
//compute all the hessianResponses
#pragma omp for collapse(2) schedule(dynamic)
for(int i=0; i<levels; i++)
for (int j = 1; j <= scaleCycles; j++)
{
int scaleCyclesLevel = scaleCycles * i;
float curSigma = par.sigmas[j];
hessResps[j+scaleCyclesLevel] = hessianResponse(blurs[j+scaleCyclesLevel], curSigma*curSigma);
}
//we need to allocate here localWrappers to keep alive the reference for FindAffineShapeArgs
#pragma omp single
{
for(int i=0; i<levels; i++)
for (int j = 2; j < scaleCycles; j++){
int scaleCyclesLevel = scaleCycles * i;
localWrappers.push_back(Wrapper(sp, ap, hessResps[j+scaleCyclesLevel-1], hessResps[j+scaleCyclesLevel], hessResps[j+scaleCyclesLevel+1],
blurs[j+scaleCyclesLevel-1], blurs[j+scaleCyclesLevel]));
}
}
std::vector<FindAffineShapeArgs> localfindAffineShapeArgs;
#pragma omp for collapse(2) schedule(dynamic) nowait
for(int i=0; i<levels; i++)
for (int j = 2; j < scaleCycles; j++){
size_t c = (scaleCycles-2) * i +j-2;
//toDo: octaveMap is shared, need synchronization
//if(j==1)
// octaveMap = Mat::zeros(blurs[scaleCyclesLevel+1].rows, blurs[scaleCyclesLevel+1].cols, CV_8UC1);
float curSigma = par.sigmas[j];
// find keypoints in this part of octave for curLevel
findLevelKeypoints(curSigma, pixelDistances[i], localWrappers[c]);
localfindAffineShapeArgs.insert(localfindAffineShapeArgs.end(), localWrappers[c].findAffineShapeArgs.begin(), localWrappers[c].findAffineShapeArgs.end());
}
#pragma omp critical
{
findAffineShapeArgs.insert(findAffineShapeArgs.end(), localfindAffineShapeArgs.begin(), localfindAffineShapeArgs.end());
}
#pragma omp barrier
std::vector<Result> localRes;
#pragma omp for schedule(dynamic) nowait
for(int i=0; i<findAffineShapeArgs.size(); i++){
hessianKeypointCallback->onHessianKeypointDetected(findAffineShapeArgs[i], localRes);
}
#pragma omp critical
{
for(size_t i=0; i<localRes.size(); i++)
descriptors.push_back(localRes[i].descriptor);
}
}
Mat gaussianBlur(const Mat input, const float sigma)
{
Mat ret(input.rows, input.cols, input.type());
int size = (int)(2.0 * 3.0 * sigma + 1.0); if (size % 2 == 0) size++;
GaussianBlur(input, ret, Size(size, size), sigma, sigma, BORDER_REPLICATE);
return ret;
}
If you consider a 50 ms (a fraction of the blink of an eye) one time cost to be a huge overhead, then you should probably focus on your workflow as such. Try to use one fully initialized process (with it's threads and data structures) in a persistent way to increase the work done during each each run.
That said, it may be possible to reduce the overhead, but in any case you will be very dependent on the runtime and initialization cost of your library, thus limiting your performance portability.
Your performance analysis may also be problematic. AFAIK VTune uses sampling, your data indicates a 1 ms sampling interval. That means you may have just 50 samples during the critical initialization path of your application, too little for a confident analysis. VTune might also have some forms of OpenMP instrumentation that provides more accurate results at small time scales. In any case I would take any performance measurement over just 150 ms with a grain of salt unless I knew exactly what impact and method the measurement has.
P.S. Running a simple code like:
#include <stdio.h>
#include <omp.h>
int main() {
double start = omp_get_wtime();
#pragma omp parallel
{
#pragma omp barrier
#pragma omp master
printf("%f s\n", omp_get_wtime() - start);
}
}
Shows an initial thread creation overhead between 3 ms and 200 ms on different systems / thread counts with the Intel OpenMP runtime.

Getting the position of all pixels that have a value bigger than X openCV

I'm looking for an efficiant way to get the posistion of all pixels that has a value bigger that let's say (100,100,100). I know I could put use minmaxLoc to get the position of the max and min but a want to add an average to it to I can get them all, and I don't want to use a while loop.
thanks in advance
You could use a pseudocode like the following:
//process the loop in a multithread way with openmp
#pragma omp parallel for
int xy=0;
for(int x = 0; x <= x_resolution; ++x) {
for(int y = 0; y <= y_resolution, ++y) {
if(value of Point bigger than (100,100,100) {
do_what_you_want;
}
++xy;
}
}