Efficiently Building Summed Area Table - c++

I am trying to construct a summed area table for later use in an adaptive thresholding routine. Since this code is going to be used in time critical software, I am trying to squeeze as many cycles as possible out of it.
For performance, the table is unsigned integers for every pixel.
When I attach my profiler, I am showing that my largest performance bottleneck occurs when performing the x-pass.
The simple math expression for the computation is:
sat_[y * width + x] = sat_[y * width + x - 1] + buff_[y * width + x]
where the running sum resets at every new y position.
In this case, sat_ is a 1-D pointer of unsigned integers representing the SAT, and buff_ is an 8-bit unsigned monochrome buffer.
My implementation looks like the following:
uint *pSat = sat_;
char *pBuff = buff_;
for (size_t y = 0; y < height; ++y, pSat += width, pBuff += width)
{
uint curr = 0;
for (uint x = 0; x < width; x += 4)
{
pSat[x + 0] = curr += pBuff[x + 0];
pSat[x + 1] = curr += pBuff[x + 1];
pSat[x + 2] = curr += pBuff[x + 2];
pSat[x + 3] = curr += pBuff[x + 3];
}
}
The loop is unrolled manually because my compiler (VC11) didn't do it for me. The problem I have is that the entire segmentation routine is spending an extraordinary amount of time just running through that loop, and I am wondering if anyone has any thoughts on what might speed it up. I have access to all of the SSE's sets, and AVX for any machine this routine will run on, so if there is something there, that would be extremely useful.
Also, once I squeeze out the last cycles, I then plan on extending this to multi-core, but I want to get the single thread computation as tight as possible before I make the model more complex.

You have a dependency chain running along each row; each result depends on the previous one. So you cannot vectorise/parallelise in that direction.
But, it sounds like each row is independent of all the others, so you can vectorise/paralellise by computing multiple rows simultaneously. You'd need to transpose your arrays, in order to allow the vector instructions to access neighbouring elements in memory.*
However, that creates a problem. Walking along rows would now be absolutely terrible from a cache point of view (every iteration would be a cache miss). The way to solve this is to interchange the loop order.
Note, though, that each element is read precisely once. And you're doing very little computation per element. So you'll basically be limited by main-memory bandwidth well before you hit 100% CPU usage.
* This restriction may be lifted in AVX2, I'm not sure...

Algorithmically, I don't think there is anything you can do to optimize this further. Even though you didn't use the term OLAP cube in your description, you are basically just building an OLAP cube. The code you have is the standard approach to building an OLAP cube.
If you give details about the hardware you're working with, there might be some optimizations available. For example, there is a GPU programming approach that may or may not be faster. Note: Another post on this thread mentioned that parallelization is not possible. This isn't necessarily true... Your algorithm can't be implemented in parallel, but there are algorithms that maintain data-level parallelism, which could be exploited with a GPU approach.

Related

Openmp nested for loop with ordered output

I'm currently trying to find a fast and reliable way to parallelize a set of loops with if conditions where I need to save a result in the inner loop.
The code is supposed to go through a huge amount of points in a 3D grid. For some points within this volume I have to check another condition (checking for an angle) and if this condition fulfilled I have to calculate a density.
The fastest ways so far were #pragma omp parallel for private (x,y,z) collapse(3) outside of all for loops or #pragma omp parallel for for the inner most loop (phiInd) which is not only the largest loop but also calls a CPU-intensive function.
I need to store the density value in the densityarr within the inner loop. The densityarray is then later saved seperately.
My problem now is, that depending on the number of threads I set, I get different results in ,y density array. The serial version and an openmp run with just 1 thread have identical results.
Increasing the number of threads leads to results at the same points, but those results are different from the serial version.
I know there is #pragma omp for ordered but this results in a too slow calculation.
Is there a way to parallelize this loop while still getting my results ordered according to my points (x,y,z)?
Or maybe clearer: Why does increasing the thread number change my result?
double phipoint, Rpoint, zpoint;
double phiplane;
double distphi = 2.0 * M_PI / nPlanes; //set desired distace to phi to assign point to fluxtubeplane
double* densityarr = new double[max_x_steps * max_y_steps * max_z_steps];
for (z = 0; z < max_z_steps; z++) {
for (x = 0; x < max_x_steps; x++) {
for (y = 0; y < max_y_steps; y++) {
double x_center = x * stepSizeGrid - max_x / 2;
double y_center = y * stepSizeGrid - max_y / 2;
double z_center = z * stepSizeGrid - max_z / 2;
cartesianCoordinate* pos = new cartesianCoordinate(x_center, y_center, z_center);
linearToroidalCoordinate* tor = linearToroidal(*pos);
simpleToroidalCoordinate* stc = simpleToroidal(*pos);
phipoint = tor->phi;
if (stc->r <= 0.174/*0.175*/) {//check if point is in vessel
for (int phiInd = 0; phiInd < nPlanes; ++phiInd) {
phiplane = phis[phiInd];
if (abs(phipoint - phiplane) <= distphi) {//find right plane for point
Rpoint = tor->R;
zpoint = tor->z;
densityarr[z * max_y_steps * max_x_steps + x * max_y_steps + y] = TubePlanes[phiInd].getMinDistDensity(Rpoint, zpoint);
}
}
}
delete pos, tor, stc;
}
}
}
First, you need to address the errors in your parallel versions. You race-conditions writing to the shared variables phipoint (parallel outer loops) and phiplane,Rpoint,zpoint (any loops parallel). You must declare those private, or better yet, declare them locally in the first place (which makes them implicitly private). That way the code is much easier to reason about - which is very important for parallel codes.
Parallelizing the outer loops like you describe is the obvious and very likely most efficient approach. If there are severe load imbalances (stc->r <= 0.174 not being evenly distributed among the points), you might want to use schedule(dynamic).
Parallelizing the inner loop seems unnecessary in your case. Generally outer loops provide better efficiency because of less overhead - unless they don't expose enough parallel work, have some race conditions, have dependencies, or cache issues. It would however be a worthwhile exercise to try and measure it. However, there may be a race condition upon writing to densityarr, if more than one of the phis satisfy the condition. Overall that loop seems a bit odd - since you only use at most one of the results in densityarr, you could rather reverse the loop and cancel once you found the first one. That helps serial execution a lot, but may inhibit parallelization. Also, if you don't find a phi that satisfies the condition - or if the point is not in the vessel, then the respective entry in densityarr remains uninitialized - that can be very dangerous because you cannot later determine if the value is valid or not.
A general remark, don't allocate objects with new unless you need to. Just put pos on the stack, likely gives you better performance. It can be a performance issue to allocate memory within a (parallel) loop, so you might want to rethink the way you get your Toroidals.
Note that I do assume that TubePlanes[phiInd].getMinDistDensity has no side effects, otherwise parallelization would be problematic.

Optimization of integral image

I'm trying to implement a multichannel integral image algorithm, but it's too slow(8 seconds for 200 images(640x480), on Core 2 Quad). I expect it to reach 1 second for 200 images.
This is the profiling result(over 200 images, n_bins=8):
How can I optimize *ps = *psu + s?
Plain text version of code
Start to check compiler settings, is it set to maximum performance?
Than, depending from architecture, calculation of integral image have several bottleneck.
Computations itself, some low cost CPU can't perform integer math with good performance. No solution.
Data flow is not optimal. The solution is to provide optimal data flows ( number of sequential read and write streams). For example you can process 2 rows simultaneously.
Data dependency of algorithm. On modern CPU it can be biggest problem. The solution is to change processing algorithm. For example calculate odd/even pixels without dependency (more calculations , less dependency).
Processing can be done using GPU.
I have trouble believing that profile result. In this code
16 for (int x = 1; x < w + 1; x++, pg++, ps += n_bins, psu += n_bins) {
17 s += *pg;
18 *ps = *psu + s;
19 }
it says the lion's share of time is on line 18, very little on 17, and next to nothing on line 16.
Yet it is also doing a comparison, two increments, and three adds on every iteration.
Cache-misses might explain it, but there's no harm in double-checking, which I do with this technique.
Regardless, the loop could be unrolled, for example:
int x = w;
while(x >= 4){
s += pg[0];
ps[n_bins*0] = psu[n_bins*0] + s;
s += pg[1];
ps[n_bins*1] = psu[n_bins*1] + s;
s += pg[2];
ps[n_bins*2] = psu[n_bins*2] + s;
s += pg[3];
ps[n_bins*3] = psu[n_bins*3] + s;
x -= 4;
pg += 4;
ps += n_bins*4;
psu += n_bins*4;
}
for(; --x >= 0;){
s += *pg;
*ps = *psu + s;
pg++;
ps += n_bins;
psu += n_bins;
}
If n_bins happens to be a constant, this could enable the compiler to do some more optimizing of the code in the while loop.
You probably don't compute integral images just for the sake of computing integral images.
I imagine two situations:
1) you use the integral images on every pixel to compute a box filter or similar.
2) you use them at a much smaller number of places.
In case 1), the computation of the integral images will probably not be the bottleneck in your application.
In case 2), you should wonder if it is worth to compute the entire integral images.
This said, parallelization with four threads is also an option. The easiest is to let every thread compute every fourth image.
You can also split every image in four, but you will be penalized by the need to synchronize the threads, but also by the fact that prefix sums are constrained by a data dependency. (You can split the image in four and compute separate integral images, but after this step you will need to add a constant to three of the quarter images.)

Running over an unrolled linked list takes around 40% of the code runtime - are there any obvious ways to optimise it?

I am the author of an open source scientific code called vampire ( http://github.com/richard-evans/vampire ), and being compute intensive means any improvement in code performance can significantly increase the amount of research that can be done. Typical runtimes of this code can be hundreds of core hours, so I am always looking for ways to improve the performance critical sections of the code. However, I have come a bit stuck with the following bit of relatively innocuous looking bit of code, which makes up around 40% of the runtime:
for (int atom = start_index; atom < end_index; atom++){
register double Hx = 0.0;
register double Hy = 0.0;
register double Hz = 0.0;
const int start = atoms::neighbour_list_start_index[atom];
const int end = atoms::neighbour_list_end_index[atom] + 1;
for (int nn = start; nn < end; nn++){
const int natom = atoms::neighbour_list_array[nn];
const double Jij = atoms::i_exchange_list[atoms::neighbour_interaction_type_array[nn]].Jij;
Hx -= Jij * atoms::x_spin_array[natom];
Hy -= Jij * atoms::y_spin_array[natom];
Hz -= Jij * atoms::z_spin_array[natom];
}
atoms::x_total_spin_field_array[atom] += Hx;
atoms::y_total_spin_field_array[atom] += Hy;
atoms::z_total_spin_field_array[atom] += Hz;
}
The high level overview of the function and variables of this code is as follows: There is a 1D array of a physical vector ( split into three 1D arrays for each component x,y,z for memory caching purposes, atoms::x_spin_array, etc) called 'spin'. Each of these spins interact with some other spins, and all the interactions are stored as a 1D neighbour list (atoms::neighbour_list_array). The relevant list of interactions for each atom is determined by a start and end index of the neighbor listarray in two separate arrays. At the end of the calculation the each atomic spin has an effective field which is the vector sum of the interactions.
Given the small amount of code and the sizable fraction of the runtime it occupies I have done by best, but I feel there must be a way to optimize this further, but as a physicist rather than a computer scientist maybe I am missing something?
You've got a constant stream of multiply, subtract and adds on contiguous data. That's seems like an ideal use of SSE. If it's memory bandwidth limited, then OpenCL/CUDA instead.
Try using this library if you aren't familiar with all the low level instructions.
That inner loop could potentially then be restructured significantly maybe leading to speed ups.
If the x, y, z components are indeed linked lists, doing x[i], y[i] and z[i] will cause the lists to be traversed multiple times, giving (n^2)/2 iterations. Using vectors will make this an O(1) operation.
You mention that the three coordinates are split out for memory caching purposes, but this will affect the Level 1 and Level 2 cache locality as you are accessing 3 different areas in memory. The linked list is also impacting your cache locality.
Using something like:
struct vector3d {
double x;
double y;
double z;
};
std::vector<vector3d> spin;
std::vector<vector3d> total_spin;
This should improve the cache locality, as the x, y and z values are adjacent in memory and the spins occupy a linear block of memory.
I feel following suggestions can help you optimize the code a bit if not completely:
Use initialization over assignments wherever possible
Prefer pre-increment over post for better speed.(believe me, it does make a change)
Apart from that I think the code is just fine. There are some pros and cons of each DS..you gotta live with it.
Happy Coding!

Where is the bottleneck in this code?

I have the following tight loop that makes up the serial bottle neck of my code. Ideally I would parallelize the function that calls this but that is not possible.
//n is about 60
for (int k = 0;k < n;k++)
{
double fone = z[k*n+i+1];
double fzer = z[k*n+i];
z[k*n+i+1]= s*fzer+c*fone;
z[k*n+i] = c*fzer-s*fone;
}
Are there any optimizations that can be made such as vectorization or some evil inline that can help this code?
I am looking into finding eigen solutions of tridiagonal matrices. http://www.cimat.mx/~posada/OptDoglegGraph/DocLogisticDogleg/projects/adjustedrecipes/tqli.cpp.html
Short answer: Change the memory layout of your matrix from row-major order to column-major order.
Long answer:
It seems you are accessing the (i)th and (i+1)th column of a matrix stored in row-major order - probably a big matrix that doesn't as a whole fit into CPU cache. Basically, on every loop iteration the CPU has to wait for RAM (in the order of hundred cycles). After a few iteraterations, theoretically, the address prediction should kick in and the CPU should speculatively load the data items even before the loop acesses them. That should help with RAM latency. But that still leaves the problem that the code uses the memory bus inefficiently: CPU and memory never exchange single bytes, only cache-lines (64 bytes on current processors). Of every 64 byte cache-line loaded and stored your code only touches 16 bytes (or a quarter).
Transposing the matrix and accessing it in native major order would increase memory bus utilization four-fold. Since that is probably the bottle-neck of your code, you can expect a speedup of about the same order.
Whether it is worth it, depends on the rest of your algorithm. Other parts may of course suffer because of the changed memory layout.
I take it you are rotating something (or rather, lots of things, by the same angle (s being a sin, c being a cos))?
Counting backwards is always good fun and cuts out variable comparison for each iteration, and should work here. Making the counter the index might save a bit of time also (cuts out a bit of arithmetic, as said by others).
for (int k = (n-1) * n + i; k >= 0; k -= n)
{
double fone=z[k+1];
double fzer=z[k];
z[k+1]=s*fzer+c*fone;
z[k] =c*fzer-s*fone;
}
Nothing dramatic here, but it looks tidier if nothing else.
As first move i'd cache pointers in this loop:
//n is about 60
double *cur_z = &z[0*n+i]
for (int k = 0;k < n;k++)
{
double fone = *(cur_z+1);
double fzer = *cur_z;
*(cur_z+1)= s*fzer+c*fone;
*cur_z = c*fzer-s*fone;
cur_z += n;
}
Second, i think its better to make templatized version of this function. As a result, you can get good perfomance benefit if your matrix holds integer values (since FPU operations are slower).

C++ - What would be faster: multiplying or adding?

I have some code that is going to be run thousands of times, and was wondering what was faster.
array is a 30 value short array which always holds 0, 1 or 2.
result = (array[29] * 68630377364883.0)
+ (array[28] * 22876792454961.0)
+ (array[27] * 7625597484987.0)
+ (array[26] * 2541865828329.0)
+ (array[25] * 847288609443.0)
+ (array[24] * 282429536481.0)
+ (array[23] * 94143178827.0)
+ (array[22] * 31381059609.0)
+ (array[21] * 10460353203.0)
+ (array[20] * 3486784401.0)
+ (array[19] * 1162261467)
+ (array[18] * 387420489)
+ (array[17] * 129140163)
+ (array[16] * 43046721)
+ (array[15] * 14348907)
+ (array[14] * 4782969)
+ (array[13] * 1594323)
+ (array[12] * 531441)
+ (array[11] * 177147)
+ (array[10] * 59049)
+ (array[9] * 19683)
+ (array[8] * 6561)
+ (array[7] * 2187)
+ (array[6] * 729)
+ (array[5] * 243)
+ (array[4] * 81)
+ (array[3] * 27)
+ (array[2] * 9)
+ (array[1] * 3)
+ (b[0]);
Would it be faster if I use something like:
if(array[29] != 0)
{
if(array[29] == 1)
{
result += 68630377364883.0;
}
else
{
result += (whatever 68630377364883.0 * 2 is);
}
}
for each of them. Would this be faster/slower? If so, by how much?
That is a ridiculously premature "optimization". Chances are you'll be hurting performance because you are adding branches to the code. Mispredicted branches are very costly. And it also renders the code harder to read.
Multiplication in modern processors is a lot faster than it used to be, it can be done a few clock cycles now.
Here's a suggestion to improve readability:
for (i=1; i<30; i++) {
result += array[i] * pow(3, i);
}
result += b[0];
You can pre-compute an array with the values of pow(3, i) if you are really that worried about performance.
First, on most architectures, mis-branching is very costly (depending on the execution pipeline depth), so I bet the non-branching version is better.
A variation on the code may be:
result = array[29];
for (i=28; i>=0; i--)
result = result * 3 + array[i];
Just make sure there are no overflows, so result must be in a type larger than 32-bit integer.
Even if addition is faster than multiplication, I think that you will lose more because of the branching. In any case, if addition is faster than multiplication, a better solution might be to use a table and index by it.
const double table[3] = {0.0, 68630377364883.0, 68630377364883.0 * 2.0};
result += table[array[29]];
My first attempt at optimisation would be to remove the floating-point ops in favour of integer arithmetic:
uint64_t total = b[0];
uint64_t x = 3;
for (int i = 1; i < 30; ++i, x *= 3) {
total += array[i] * x;
}
uint64_t is not standard C++, but is very widely available. You just need a version of C99's stdint for your platform.
There's also optimising for comprehensibility and maintainability - was this code a loop at one point, and did you measure the performance difference when you replaced the loop? Fully unrolling like this might even make the program slower (as well as less readable), since the code is larger and hence occupies more of the instruction cache, and hence results in cache misses elsewhere. You just don't know.
This assuming of course that your constants actually are the powers of 3 - I haven't bothered checking, which is precisely what I consider to be the readability issue with your code...
This is basically doing what strtoull does. If you don't have the digits handy as an ASCII string to feed to strtoull then I guess you have to write your own implementation. As people point out, branching is what causes a performance hit, so your function is probably best written this way:
#include <tr1/cstdint>
uint64_t base3_digits_to_num(uint8_t digits[30])
{
uint64_t running_sum = 0;
uint64_t pow3 = 1;
for (int i = 0; i < 30; ++i) {
running_sum += digits[i] * pow3;
pow3 *= 3;
}
return running_sum;
}
It's not clear to me that precomputing your powers of 3 is going to result in a significant speed advantage. You might try it and test yourself. The one advantage a lookup table might give you is that a smart compiler could possibly unroll the loop into a SIMD instruction. But a really smart compiler should then be able to do that anyway and generate the lookup table for you.
Avoiding floating point is also not necessarily a speed win. Floating point and integer operations are about the same on most processors produced in the last 5 years.
Checking to see if digits[i] is 0, 1 or 2 and executing different code for each of these cases is definitely a speed lose on any processor produced in the last 10 years. The Pentium3/Pentium4/Athlon Thunderbird days are when branches started to really become a huge hit, and the Pentium3 is at least 10 years old now.
Lastly, you might think this will be the bottleneck in your code. You're probably wrong. The right implementation is the one that is the simplest and most clear to anybody coming along reading your code. Then, if you want the best performance, run your code through a profiler and find out where to concentrate your optimization efforts. Agonizing this much over a little function when you don't even know that it's a bottleneck is silly.
And almost nobody here recognized that you were basically doing a base 3 conversion. So even your current primitive hand loop unrolling obscured your code enough that most people didn't understand it.
Edit: In fact, I looked at the assembly output. On an x86_64 platform the lookup table buys you nothing and may in fact be counter-productive because of its affect on the cache. The compiler generates leaq (%rdx,%rdx,2), %rdx in order to multiply by 3. Fetching from a table would be something like moveq (%rdx,%rcx,8), %eax, which is basically the same speed aside from requiring a fetch from memory (which might be very expensive). So it's almost certain that my code with the gcc option -funroll-loops is significantly faster than your attempt to optimize by hand.
The lesson here is that the compiler does a much, much better job of optimization than you can. Just make your code as clear and readable to others as possible and let the compiler do the work. And making it clear to others has the additional advantage of making it easier for the compiler to do its job.
If you're not sure - why don't you just measure it yourself?
Second example will be most likely much slower, but not because of the addition - mispredicted conditional jumps cost a lot of time.
If you have only 3 values, the cheapest way might be to have a static 2D array of values int **vals = {{0, 1*3, 2*3}, {0, 1*9, 2*9}, ...} and just sum vals[0][array[1]] + vals[1][array[2]] + ...
Some SIMD instructions might be faster than anything you can write on your own - look at those. Then again - if you're doing this a lot, handing it off to GPU might be even faster - depending on your other calculations.
Multiply, because branching is awefully slow