Threads are slow c++ - c++

im trying to draw a mandelbrot and want to use 4 threats to do the calculation at the same time but a different part of the image , here are the functions
void Mandelbrot(int x_min,int x_max,int y_min,int y_max,Image &im)
{
for (int i = y_min; i < y_max; i++)
{
for (int j = x_min; j < x_max; j++)
{
//scaled x and y cordinate
double x0 = mape(j, 0, W, MinX, MaxX);
double y0 = mape(i, 0, H, MinY, MaxY);
double x = 0.0f;
double y = 0.0f;
int iteration = 0;
double z = 0;
while (abs(z)<2.0f && iteration < maxIteration)// && iteration < maxIteration)
{
double xtemp = x * x - y * y + x0;
y = 2 * x * y + y0;
x = xtemp;
iteration++;
z = x * x + y * y;
if (z > 10)//must be 10
break;
}
int b =mape(iteration, 0, maxIteration, 0, 255);
if (iteration == maxIteration)
b = 0;
im.setPixel(j, i, Color(b,b,0));
}
}
}
mape functions just convert a number from one range to another
Here is the thread function
void th(Image& im)
{
float size = (float)im.getSize().x / num_th;
int x_min = 0, x_max = size, y_min = 0, y_max = im.getSize().y;
thread t[num_th];
for (size_t i = 0; i < num_th; i++)
{
t[i] = thread(Mandelbrot, x_min, x_max, y_min, y_max, ref(im));
x_min = x_max;
x_max += size;
}
for (size_t i = 0; i<num_th; i++)
{
t[i].join();
}
}
The main function looks like this
int main()
{
Image img;
while(1)//here is while window.open()
{
th(img);
//here im drawing
}
}
So i am not getting any performance boost but it gets even slower , can anyone tell my where is the problem what im doing wrong , it happened to me before too
I sow a question what is an image , it's a class from the SFML library dont'n know if this is of any help.

Your code is incomplete to be able to answer you concretely, but there are a few suspicions:
Spawning a thread has non-trivial overhead. If the amount of work performed by the thread is not large enough, the overhead of launching it may cost more than any gains you would get through parallelism.
Excessive locking and contention. Does not look like a problem in your code, as you don't seem to use any locks at all. Be careful (though as long as they don't write to the same addresses, it should be correct.)
False sharing: Possible problem in your code. Cache lines tend to be 64 bytes. Any write to any portion of a cache line causes the whole line to be committed to memory. If two threads are looking at the same cache line and one of them writes to it, even if all the other threads use a different part of that cache line, they all will have their copy invalidated and will have to re-fetch. This can cause significant problems if multiple threads work in non-overlapping data that share a cache line and cause these invalidations. If they iterate at the same rate through the same data, it can cause this problem to recur over and over. This problem can be significant, and always worth considering.
memory layout causing your cache to be thrashed. While walking through an array, going "across" may align with actual memory layout, reading one full cacheline after another, but scanning "vertically" touches one portion of a cache line then jumps to the corresponding portion of another cache line. If this happens in many threads and you have a lot of memory to churn through, it can mean that your cache is vastly underutilized. Just something to beware of, whether your machine is row- or column- major, and write code to match it, and avoid jumping around in memory.

Related

Function started with std::async crashes after quite a few iterations

I am trying to develop a simple evolution algorithm in C++. To make my calculations faster I decided to use async functions to run multiple calculations at once:
std::vector<std::future<int> > compute(8);
unsigned nptr = 0;
int syncp = 0;
while(nptr != network::networks.size()){
compute.at(syncp) = std::async(&network::analyse, &network::networks.at(nptr), data, width, height, sw, dFnum.at(idx));
syncp++;
if(syncp == 8){
syncp = 0;
for(unsigned i = 0; i < 8; i++){
compute.at(i).get();
}
}
nptr++;
}
This is how I start my calculating function. The function is called analyse, and for each "network" it assigns a score depending on how good it identifies the image.
This is part of the analyse function:
for(unsigned i = 0; i < entry.size(); i++){
double sum = 0;
data * d = &entry.at(i);
pattern * p = &pattern::patterns.at(d->patNo);
int sx = iWidth;
int sy = iHeight;
if(d->xPercentage*iWidth + d->xSpan*iWidth < sx) sx = d->xPercentage*iWidth + d->xSpan*iWidth;
if(d->yPercentage*iHeight + d->xSpan*iWidth < sy) sy = d->yPercentage*iHeight + d->xSpan*iWidth;
int xdisp = sx-d->xPercentage*iWidth;
int ydisp = sy-d->yPercentage*iHeight;
for(int x = d->xPercentage*iWidth; x < sx; x++){
for(int y = d->yPercentage*iHeight; y < sy; y++){
double xpl = x-d->xPercentage*iWidth;
double ypl = y-d->yPercentage*iHeight;
xpl /= xdisp;
ypl /= ydisp;
unsigned idx = (unsigned)(xpl*(p->width) + ypl*(p->height)*(p->width));
if(idx >= p->lweight.size()) idx = p->lweight.size()-1;
double weight = p->lweight.at(idx) - 5;
if(imageData[y*iWidth+x])
sum += weight;
else
sum -= 2*weight;
}
}
digitWeight[d->digit-1] += sum;
}
}
Now, there is no need to analyse the function itself - I'm sure it works, I have tested it on a single thread, and it runs just fine. The only problem is, after some time of execution, I get errors like segmentation fault, or vector range check error.
They mostly happen at this line:
digitWeight[d->digit-1] += sum;
Now, you can be sure that d->digit-1 is a valid range for this array.
The problem is that the value of the d pointer is different than it was here:
data * d = &entry.at(i);
It magically changes during the execution of the function, and starts pointing to different data, leading to errors. I have tried saving the value of d->digit to some variable and later use this variable, and it worked fine for just a while longer, before crashing on another shared resource, imageData this time.
I'm thinking this might be something related to data sharing - all async functions share the same array of data - it's a static vector. But this data is only read, not written anywhere, so why would it stop working? I know of something called mutex locking, but this would make no sense to lock this async functions, as it would run just as slow as a single threaded program would run.
I have also tried running the functions like this:
std::vector<std::thread*> threads(8);
unsigned nptr = 0;
int threadp = 0;
while(nptr != network::networks.size()){
threads.at(threadp) = new std::thread(&network::analyse, &network::networks.at(nptr), data, width, height, sw, dFnum.at(idx));
threadp++;
if(threadp == 8){
threadp = 0;
for(unsigned i = 0; i < 8; i++){
if(threads.at(i)->joinable()) threads.at(i)->join();
delete threads.at(i);
}
}
nptr++;
}
and it did work for a second, but after some time a very similar error appeared.
Data is a structure containing 7 integers, one of which is an ID of
pattern, and pattern is a class that contains two integers - width and height
and vector of chars.
Why does it happen on read-only data and how can I prevent it?
Here is an example of what happens:

Ineffective "Peel/Remainder" Loop in my code

I have this function:
bool interpolate(const Mat &im, float ofsx, float ofsy, float a11, float a12, float a21, float a22, Mat &res)
{
bool ret = false;
// input size (-1 for the safe bilinear interpolation)
const int width = im.cols-1;
const int height = im.rows-1;
// output size
const int halfWidth = res.cols >> 1;
const int halfHeight = res.rows >> 1;
float *out = res.ptr<float>(0);
const float *imptr = im.ptr<float>(0);
for (int j=-halfHeight; j<=halfHeight; ++j)
{
const float rx = ofsx + j * a12;
const float ry = ofsy + j * a22;
#pragma omp simd
for(int i=-halfWidth; i<=halfWidth; ++i, out++)
{
float wx = rx + i * a11;
float wy = ry + i * a21;
const int x = (int) floor(wx);
const int y = (int) floor(wy);
if (x >= 0 && y >= 0 && x < width && y < height)
{
// compute weights
wx -= x; wy -= y;
int rowOffset = y*im.cols;
int rowOffset1 = (y+1)*im.cols;
// bilinear interpolation
*out =
(1.0f - wy) * ((1.0f - wx) * imptr[rowOffset+x] + wx * imptr[rowOffset+x+1]) +
( wy) * ((1.0f - wx) * imptr[rowOffset1+x] + wx * imptr[rowOffset1+x+1]);
} else {
*out = 0;
ret = true; // touching boundary of the input
}
}
}
return ret;
}
halfWidth is very random: it can be 9, 84, 20, 95, 111...I'm only trying to optimize this code, I don't understand it in details.
As you can see, the inner for has been already vectorized, but Intel Advisor suggests this:
And this is the Trip Count analysis result:
To my understand this means that:
Vector length is 8, so it means that 8 floats can be processed at the same time for each loop. This would mean (if I'm not wrong) that data are 32 bytes aligned (even though as I explain here it seems that the compiler think that data is not aligned).
On average, 2 cycles are totally vectorized, while 3 cycles are remainder loops. The same goes for Min and Max. Otherwise I don't understand what ; means.
Now my question is: how can I follow Intel Advisor first suggestion? It says to "increase the size of objects and add iterations so the trip count is a multiple of vector length"...Ok, so it's simply sayin' "hey man do this so halfWidth*2+1 (since it goes from -halfWidth to +halfWidth is a multiple of 8)". But how can I do this? If I add random cycles, this would obviously break the algorithm!
The only solution that came to my mind is to add "fake" iterations like this:
const int vectorLength = 8;
const int iterations = halfWidth*2+1;
const int remainder = iterations%vectorLength;
for(int i=0; i<loop+length-remainder; i++){
//this iteration was not supposed to exist, skip it!
if(i>halfWidth)
continue;
}
Of course this code would not work since it goes from -halfWidth to halfWidth, but it's to make you understand my strategy of "fake" iterations.
About the second option ("Increase the size of static and automatic objects, and use a compiler option to add data padding") I have no idea how to implement this.
First, you have to check Vector Advisor Efficiency metric as well as relative time spent in Loop Remainder compared to Loop Body (see hotspots list in advisor). If efficiency is close to 100% (or time spent in Remainder is very small), then it is not worth effort (and money as MSalters mentioned in comments).
If it is << 100% (and there are no other penalties reported by the tool), then you can either refactor the code to "add fake iterations" (rare users can afford it) or you should try #pragma loop_count for most typical #iterations values (depending on typical halfWidth value).
If halfWIdth is totally random (no common or average values), then there is nothing you can really do with this issue.

How to fast calculate the normalized l1 and l2 norm of a vector in C++?

I have a matrix X that has n column data vectors in d dimensional space.
Given a vector xj, v[j] is its l1 norm (the summation of all abs(xji)), w[j] is the square of its l2 norm (the summation of all xji^2), and pj[i] is the combination of entries divided by l1 and l2 norm. Finally, I need the outputs: pj, v, w for subsequet applications.
// X = new double [d*n]; is the input.
double alpha = 0.5;
double *pj = new double[d];
double *x_abs = new double[d];
double *x_2 = new double[d];
double *v = new double[n]();
double *w = new double[n]();
for (unsigned long j=0; j<n; ++j) {
jm = j*m;
jd = j*d;
for (unsigned long i=0; i<d; ++i) {
x_abs[i] = abs(X[i+jd]);
v[j] += x_abs[i];
x_2[i] = x_abs[i]*x_abs[i];
w[j] += x_2[i];
}
for (unsigned long i=0; i<d; ++i){
pj[i] = alpha*x_abs[i]/v[j]+(1-alpha)*x_2[i]/w[j];
}
// functionA(pj){ ... ...} for subsequent applications
}
// functionB(v, w){ ... ...} for subsequent applications
My above algorithm takes O(nd) Flops/Time-complexity, can any one help me to speed up it by using building-functoin or new implementation in C++? Reducing the constant value in O(nd) is also very helpful for me.
Let me guess: since you have problems related with the performance, the dimension of your vectors is quite large.If this is the case, then it worth considering "CPU cache locality" - some interesting info on this in a cppcon14 presentation.
If the data is not available in the CPU caches, then abs-ing or squaring it it once available is dwarfed by the time the CPU just wait for the data.
With this is mind, you may want to try the following solution (with no warranties that will improve performance - the compiler may actually apply these techniques when optimizing the code)
for (unsigned long j=0; j<n; ++j) {
// use pointer arithmetic - at > -O0 the compiler will do it anyway
double *start=X+j*d, *end=X+(j+1)*d;
// this part avoid as much as possible the competition
// on CPU caches between X and v/w.
// Don't store the norms in v/w as yet, keep them in registers
double l1norm=0, l2norm=0;
for(double *src=start; src!=end; src++) {
double val=*src;
l1norm+=abs(src);
l2norm+= src*src;
}
double pl1=alpha/l1norm, pl2=(1-alpha)*l2norm;
for(double *src=start, *dst=pj; src!=end; src++, dst++) {
// Yes, recomputing abs/sqr may actually save time by not
// creating competition on CPU caches with x_abs and x_2
double val=*src;
*dst = pl1*abs(val) + pl2*val*val;
}
// functionA(pj){ ... ...} for subsequent applications
// Think well if you really need v/w. If you really do,
// at least there are two values to be sent for storage into memory,
//meanwhile the CPU can actually load the next vector into cache
v[j]=l1norm; w[j]=l2norm;
}
// functionB(v, w){ ... ...} for subsequent applications

writing slower than the operation itself?

I am struggling to understand behavior of my functions.
My code is written in C++ in visual studio 2012. Running on Windows 7 64 bit. I am working with 2D arrays of float numbers. when I time my function I see that the time for function is reduced by 10X or more if I just stop writing my results to the output pointer. Does that mean that writing is slow?
Here is an example:
void TestSpeed(float** pInput, float** pOutput)
{
UINT32 y, x, i, j;
for (y = 3; y < 100-3; y++)
{
for (x = 3; x < 100-3; x++)
{
float fSum = 0;
for (i = y - 3; i <= y+3; i++)
{
for (j = x-3; j <= x+3; j++)
{
fSum += pInput[y][x]*exp(-(pInput[y][x]-pInput[i][j])*(pInput[y][x]-pInput[i][j]));
}
}
pOutput[y][x] = fSum;
}
}
}
If I comment out the line "pOutput[y][x] = fSum;" then the functions runs very quick. Why is that?
I am calling 2-3 such functions sequentially. Would it help to use stack instead of heap to write chunk of results and passing it onto next function and then write back to heap buffer after that chunk is ready?
In some cases I saw that if I replace pOutput[y][x] by a line buffer allocated on stack like,
float fResult[100] and use it to store results works faster for larger data size.
Your code makes a lot of operation and it needs time. Depending on what you are doing with the output you may consider the diagonalization or decomposition of your input matrix. Or you can look for values in yor output which are n times an other value etc and don't calculate the exponential for theese.

munmap_chunk() - Invalid pointer error

I'm writing a renderer using low-level SDL functions to learn how it all works. I am now trying to do polygon drawing, but I run into errors possibly due to my inexperience with C++. When running the code I get a munmap_chunk() - Invalid pointer error. Searching reveals that it is most likely due to free()-ing the memory twice. The error happens when returning from the function. I realize that the error comes from automatically free()ing memory which has been automatically free()d before, but I'm not experienced enough with C++ to spot the error. Any clues?
My code:
void DrawPolygon (const vector<vec3> & verts, vec3 color){
// 0. Project to the screen
vector<ivec2> vertices(verts.size());
for(int i = 0; i < verts.size(); i++){
VertexShader(verts.at(i), vertices.at(i));
}
// 1. Find max and min y-value of the polygon
// and compute the number of rows it occupies.
int miny = vertices[0].y;
int maxy = vertices[0].y;
for (int i = 1; i < 3; i++){
if (vertices[i].y < miny){
miny = vertices[i].y;
}
if (vertices[i].y > maxy){
maxy = vertices[i].y;
}
}
int rows = abs(maxy - miny) + 1;
// 2. Resize leftPixels and rightPixels
// so that they have an element for each row.
vector<ivec2> leftPixels(rows);
vector<ivec2> rightPixels(rows);
// 3. Initialize the x-coordinates in leftPixels
// to some really large value and the x-coordinates
// in rightPixels to some really small value.
for(int i = 0; i < rows; i++){
leftPixels[i].x = std::numeric_limits<int>::max();
rightPixels[i].x = std::numeric_limits<int>::min();
leftPixels[i].y = miny + i;
rightPixels[i].y = miny + i;
}
// 4. Loop through all edges of the polygon and use
// linear interpolation to find the x-coordinate for
// each row it occupies. Update the corresponding
// values in rightPixels and leftPixels.
for(int i = 0; i < 3; i++){
ivec2 a = vertices[i];
ivec2 b = vertices[(i+1)%3];
// find the number of pixels to draw
ivec2 delta = glm::abs(a - b);
int pixels = glm::max(delta.x, delta.y) + 1;
// interpolate to find the pixels
vector<ivec2> line (pixels);
Interpolate(a, b, line);
for(int j = 0; j < pixels; j++){
ivec2 p = line[j];
ivec2 cmpl = leftPixels[p.y - miny];
ivec2 cmpr = rightPixels[p.y - miny];
if(p.x < cmpl.x){
leftPixels[p.y - miny].x = p.x;
//leftPixels[p.y - miny] = cmpl;
}
if(p.x > cmpr.x){
rightPixels[p.y - miny].x = p.x;
//cmpr.x = p.x;
//rightPixels[p.y - miny] = cmpr;
}
}
}
for(int i = 0; i < leftPixels.size(); i++){
ivec2 l = leftPixels.at(i);
ivec2 r = rightPixels.at(i);
// y coord the same, iterate over x
int y = l.y;
for(int x = l.x; x <= r.x; x++){
PutPixelSDL(screen, x, y, color);
}
}
}
Using valgrind gives me this output (this is the first error it reports). Weirdly, the program recovers and keeps running with the expected result, apparently not getting the same error again:
==5706== Invalid write of size 4
==5706== at 0x40AD61: DrawPolygon(std::vector<glm::detail::tvec3<float>, std::allocator<glm::detail::tvec3<float> > > const&, glm::detail::tvec3<float>) (in /home/actimia/prog/dgi14/lab3/ThirdLab)
==5706== by 0x409C78: Draw() (in /home/actimia/prog/dgi14/lab3/ThirdLab)
==5706== by 0x409668: main (in /home/actimia/prog/dgi14/lab3/ThirdLab)
I think my previous post on similar topic would be useful.
https://stackoverflow.com/a/22658693/2724703
From your Valgrind report, it look like your program is doing memory corruption due to overflow. This does not seems like "double free" error(this is overflow scenario). You have mentioned that sometime valgrind is not reporting any error this makes this problem more difficult. However there is certainly a memory corruption and you must fix them. Memory error sometime occur intermittent due to various reason(different input parameter, multi-threaded, change of execution sequence).