C++: Improving cache performance in a 3d array - c++

I don't know how to optimize cache performance at a really low level, thinking about cache-line size or associativity. That's not something you can learn overnight. Considering my program will run on many different systems and architectures, I don't think it would be worth it anyway. But still, there are probably some steps I can take to reduce cache misses in general.
Here is a description of my problem:
I have a 3d array of integers, representing values at points in space, like [x][y][z]. Each dimension is the same size, so it's like a cube. From that I need to make another 3d array, where each value in this new array is a function of 7 parameters: the corresponding value in the original 3d array, plus the 6 indices that "touch" it in space. I'm not worried about the edges and corners of the cube for now.
Here is what I mean in C++ code:
void process3DArray (int input[LENGTH][LENGTH][LENGTH],
int output[LENGTH][LENGTH][LENGTH])
{
for(int i = 1; i < LENGTH-1; i++)
for (int j = 1; j < LENGTH-1; j++)
for (int k = 1; k < LENGTH-1; k++)
//The for loops start at 1 and stop before LENGTH-1
//or other-wise I'll get out-of-bounds errors
//I'm not concerned with the edges and corners of the
//3d array "cube" at the moment.
{
int value = input[i][j][k];
//I am expecting crazy cache misses here:
int posX = input[i+1] [j] [k];
int negX = input[i-1] [j] [k];
int posY = input[i] [j+1] [k];
int negY = input[i] [j-1] [k];
int posZ = input[i] [j] [k+1];
int negZ = input[i] [j] [k-1];
output [i][j][k] =
process(value, posX, negX, posY, negY, posZ, negZ);
}
}
However, it seems like if LENGTH is large enough, I'll get tons of cache misses when I'm fetching the parameters for process. Is there a cache-friendlier way to do this, or a better way to represent my data other than a 3d array?
And if you have the time to answer these extra questions, do I have to consider the value of LENGTH? Like it's different whether LENGTH is 20 vs 100 vs 10000. Also, would I have to do something else if I used something other than integers, like maybe a 64-byte struct?
# ildjarn:
Sorry, I did not think that the code that generates the arrays I am passing into process3DArray mattered. But if it does, I would like to know why.
int main() {
int data[LENGTH][LENGTH][LENGTH];
for(int i = 0; i < LENGTH; i++)
for (int j = 0; j < LENGTH; j++)
for (int k = 0; k < LENGTH; k++)
data[i][j][k] = rand() * (i + j + k);
int result[LENGTH][LENGTH][LENGTH];
process3DArray(data, result);
}

There's an answer to a similar question here: https://stackoverflow.com/a/7735362/6210 (by me!)
The main goal of optimizing a multi-dimensional array traversal is to make sure you visit the array such that you tend to reuse the cache lines accessed from the previous iteration step. For visiting each element of an array once and only once, you can do this just by visiting in memory order (as you are doing in your loop).
Since you are doing something more complicated than a simple element traversal (visiting an element plus 6 neighbors), you need to break up your traversal such that you don't access too many cache lines at once. Since the cache thrashing is dominated by traversing along j and k, you just need to modify the traversal such that you visit blocks at a time rather than rows at a time.
E.g.:
const int CACHE_LINE_STEP= 8;
void process3DArray (int input[LENGTH][LENGTH][LENGTH],
int output[LENGTH][LENGTH][LENGTH])
{
for(int i = 1; i < LENGTH-1; i++)
for (int k_start = 1, k_next= CACHE_LINE_STEP; k_start < LENGTH-1; k_start= k_next; k_next+= CACHE_LINE_STEP)
{
int k_end= min(k_next, LENGTH - 1);
for (int j = 1; j < LENGTH-1; j++)
//The for loops start at 1 and stop before LENGTH-1
//or other-wise I'll get out-of-bounds errors
//I'm not concerned with the edges and corners of the
//3d array "cube" at the moment.
{
for (int k= k_start; k<k_end; ++k)
{
int value = input[i][j][k];
//I am expecting crazy cache misses here:
int posX = input[i+1] [j] [k];
int negX = input[i-1] [j] [k];
int posY = input[i] [j+1] [k];
int negY = input[i] [j-1] [k];
int posZ = input[i] [j] [k+1];
int negZ = input[i] [j] [k-1];
output [i][j][k] =
process(value, posX, negX, posY, negY, posZ, negZ);
}
}
}
}
What this does in ensure that you don't thrash the cache by visiting the grid in a block oriented fashion (actually, more like a fat column oriented fashion bounded by the cache line size). It's not perfect as there are overlaps that cross cache lines between columns, but you can tweak it to make it better.

The most important thing you already have right. If you were using Fortran, you'd be doing it exactly wrong, but that's another story. What you have right is that you are processing in the inner loop along the direction where memory addresses are closest together. A single memory fetch (beyond the cache) will pull in multiple values, corresponding to a series of adjacent values of k. Inside your loop the cache will contain some number of values from i,j; a similar number from i+/-1, j and from i,j+/-1. So you basically have five disjoint sections of memory active. For small values of LENGTH these will only be 1 or three sections of memory. It is in the nature of how caches are built that you can have more than this many disjoint sections of memory in your active set.
I hope process() is small, and inline. Otherwise this may well be insignificant. Also, it will affect whether your code fits in the instruction cache.
Since you're interested in performance, it is almost always better to initialize five pointers (you only need one for value, posZ and negZ), and then take *(p++) inside the loop.
input[i+1] [j] [k];
is asking the compiler to generate 3 adds and two multiplies, unless you have a very good optimizer. If your compiler is particularly lazy about register allocation, you also get four memory accesses; otherwise one.
*inputIplusOneJK++
is asking for one add and a memory reference.

Related

Why is bool [][] more efficient than vector<vector<bool>>

I am trying to solve a DP problem where I create a 2D array, and fill the 2D array all the way. My function is called multiple times with different test cases. When I use a vector>, I get a time limit exceeded error (takes more than 2 sec for all the test cases). However, when I use bool [][], takes much less time (about 0.33 sec), and I get a pass.
Can someone please help me understand why would vector> be any less efficient than bool [][].
bool findSubsetSum(const vector<uint32_t> &input)
{
uint32_t sum = 0;
for (uint32_t i : input)
sum += i;
if ((sum % 2) == 1)
return false;
sum /= 2;
#if 1
bool subsum[input.size()+1][sum+1];
uint32_t m = input.size()+1;
uint32_t n = sum+1;
for (uint32_t i = 0; i < m; ++i)
subsum[i][0] = true;
for (uint32_t j = 1; j < n; ++j)
subsum[0][j] = false;
for (uint32_t i = 1; i < m; ++i) {
for (uint32_t j = 1; j < n; ++j) {
if (j < input[i-1])
subsum[i][j] = subsum[i-1][j];
else
subsum[i][j] = subsum[i-1][j] || subsum[i-1][j - input[i-1]];
}
}
return subsum[m-1][n-1];
#else
vector<vector<bool>> subsum(input.size()+1, vector<bool>(sum+1));
for (uint32_t i = 0; i < subsum.size(); ++i)
subsum[i][0] = true;
for (uint32_t j = 1; j < subsum[0].size(); ++j)
subsum[0][j] = false;
for (uint32_t i = 1; i < subsum.size(); ++i) {
for (uint32_t j = 1; j < subsum[0].size(); ++j) {
if (j < input[i-1])
subsum[i][j] = subsum[i-1][j];
else
subsum[i][j] = subsum[i-1][j] || subsum[i-1][j - input[i-1]];
}
}
return subsum.back().back();
#endif
}
Thank you,
Ahmed.
If you need a matrix and you need to do high performance stuff, it is not always the best solution to use a nested std::vector or std::array because these are not contiguous in memory. Non contiguous memory access results in higher cache misses.
See more :
std::vector and contiguous memory of multidimensional arrays
Is the data in nested std::arrays guaranteed to be contiguous?
On the other hand bool twoDAr[M][N] is guarenteed to be contiguous. It ensures less cache misses.
See more :
C / C++ MultiDimensional Array Internals
And to know about cache friendly codes:
What is “cache-friendly” code?
Can someone please help me understand why would vector> be any less
efficient than bool [][].
A two-dimensional bool array is really just a big one-dimensional bool array of size M * N, with no gaps between the items.
A two-dimensional std::vector doesn't exist; is not just one big one-dimensional std::vector but a std::vector of std::vectors. The outer vector itself has no memory gaps, but there may well be gaps between the content areas of the individual vectors. It depends on how your compiler implements the very special std::vector<bool> class, but if your element count is sufficiently big, then dynamic allocation is unavoidable to prevent a stack overflow, and that alone implies pointers to separated memory areas.
And once you need to access data from separated memory areas, things become slower.
Here is a possible solution:
Try to use a std::vector<bool> of size (input.size() + 1) * (sum + 1).
If that fails to make things faster, avoid the template specialisation by using a std::vector<char> of size (input.size() + 1) * (sum + 1), and cast the elements to and from bool as required.
In cases where you know the input size of the array from the beginning, using arrays will always be faster than Vector or equal. Because vector is a wrapper around array, therefore a higher-level implementation. It's benefit is allocating extra space for you when needed, which you don't need if you have a fixed size of elements.
If you had problem where you needed 1D arrays, the difference may not have bothered you(where you have a single vector, and a single array). But creating a 2D array, you also create many instances of the vector class, so the time difference between array and vector is multiplied by the number of elements you have in your container, making your code slow.
This time difference has many causes behind it, but the most obvious one is of course calling the vector constructor. You are calling a function subsum.size() times. The memory issue mentioned by other answers is another cause.
For performance, it is advised to use array's whenever you can in your code. Even if you need to use a vector, you should try to minimize the number of re-size's done by the vector(reserving, pre-allocating), achieving a closer implementation to array.

writing slower than the operation itself?

I am struggling to understand behavior of my functions.
My code is written in C++ in visual studio 2012. Running on Windows 7 64 bit. I am working with 2D arrays of float numbers. when I time my function I see that the time for function is reduced by 10X or more if I just stop writing my results to the output pointer. Does that mean that writing is slow?
Here is an example:
void TestSpeed(float** pInput, float** pOutput)
{
UINT32 y, x, i, j;
for (y = 3; y < 100-3; y++)
{
for (x = 3; x < 100-3; x++)
{
float fSum = 0;
for (i = y - 3; i <= y+3; i++)
{
for (j = x-3; j <= x+3; j++)
{
fSum += pInput[y][x]*exp(-(pInput[y][x]-pInput[i][j])*(pInput[y][x]-pInput[i][j]));
}
}
pOutput[y][x] = fSum;
}
}
}
If I comment out the line "pOutput[y][x] = fSum;" then the functions runs very quick. Why is that?
I am calling 2-3 such functions sequentially. Would it help to use stack instead of heap to write chunk of results and passing it onto next function and then write back to heap buffer after that chunk is ready?
In some cases I saw that if I replace pOutput[y][x] by a line buffer allocated on stack like,
float fResult[100] and use it to store results works faster for larger data size.
Your code makes a lot of operation and it needs time. Depending on what you are doing with the output you may consider the diagonalization or decomposition of your input matrix. Or you can look for values in yor output which are n times an other value etc and don't calculate the exponential for theese.

How to parallelize a loop?

I'm using OpenMP on C++ and I want to parallelize very simple loop. But I can't do it correctly. All time I get wrong result.
for(i=2;i<N;i++)
for(j=2;j<N;j++)
A[i,j] =A[i-2,j] +A[i,j-2];
Code:
int const N = 10;
int arr[N][N];
#pragma omp parallel for
for (int i = 0; i < N; i++)
for (int j = 0; j < N; j++)
arr[i][j] = 1;
#pragma omp parallel for
for (int i = 2; i < N; i++)
for (int j = 2; j < N; j++)
{
arr[i][j] = arr[i-2][j] +arr[i][j-2];
}
for (int i = 0; i < N; i++)
{
for (int j = 0; j < N; j++)
printf_s("%d ",arr[i][j]);
printf("\n");
}
Do you have any suggestions how I can do it? Thank you!
serial and parallel run will give different. result because in
#pragma omp parallel for
for (int i = 2; i < N; i++)
for (int j = 2; j < N; j++)
{
arr[i][j] = arr[i-2][j] +arr[i][j-2];
}
.....
you update arr[i]. so you change data used by the other thread. it will lead to a read over write data race!
This
#pragma omp parallel for
for (int i = 2; i < N; i++)
for (int j = 2; j < N; j++)
{
arr[i][j] = arr[i-2][j] +arr[i][j-2];
}
is always going to be a source of grief and unpredictable output. The OpenMP run time is going to hand each thread a range of values for i and leave them to it. There will be no determinism in the relative order in which threads update arr. For example, while thread 1 is updating elements with i = 2,3,4,5,...,100 (or whatever) and thread 2 is updating elements with i = 102,103,104,...,200 the program does not determine whether thread 1 updates arr[i,:] = 100 before or after thread 2 wants to use the updated values in arr. You have written a code with a classic data race.
You have a number of options to fix this:
You could tie yourself in knots trying to ensure that the threads update arr in the right (ie sequential) order. The end result would be an OpenMP program that runs more slowly than the sequential program. DO NOT TAKE THIS OPTION.
You could make 2 copies of arr and always update from one to the other, then from the other to the one. Something like (very pseudo-code)
for ...
{
old = 0
new = 1
arr[i][j][new] = arr[i-2][j][old] +arr[i][j-2][old];
old = 1
new = 0
}
Of course, this second approach trades space for time but that's often a reasonable trade-off.
You may find that adding an extra plane to arr doesn't immediately speed things up because it wrecks the spatial locality of values pulled into cache. Experiment a bit with this, possibly make [old] the first index element rather than the last.
Since updating each element in the array depends on the values found in elements 2 rows/columns away you're effectively splitting the array up like a chess-board, into white and black elements. You could use 2 threads, one on each 'colour', without the threads racing for access to the same data. Again, though, the disruption of spatial locality in the cache might have a bad impact on speed.
If any other options occur to me I'll edit them in.
To parallelize the loop nest in the question is tricky, but doable. Lamport's paper "The Parallel Execution of DO Loops" covers the technique. Basically you have to rotate your (i,j) coordinates by 45 degrees into a new coordinate system (k,l), where k=i+j and l=i-j.
Though to actually get speedup, the iterations likely have to be grouped into tiles, which makes the code even uglier (four nested loops).
A completely different approach is to solve the problem recursively, using OpenMP tasking. The recursion is:
if( too small to be worth parallelizing ) {
do serially
} else {
// Recursively:
Do upper left quadrant
Do lower left and upper right quadrants in parallel
Do lower right quadrant
}
As a practical matter, the ratio of arithmetic operations to memory accesses is so low that it is going to be difficult to get speedup out of the example.
If you ask about parallelism in general, then one more possible answer is vectorization. You could achieve some relatively poor vector parallelizm (something like 2x speedup or so) without
changing the data structure and codebase. This is possible using OpenMP4.0 or CilkPlus pragma simd or similar (with safelen/vectorlength(2))
Well, you really have data dependence (both inner and outer loops), but it belongs to «WAR»[ (write after read) dependencies sub-category, which is blocker for using «omp parallel for» «as is» but not necessarily a problem for «pragma omp simd» loops.
To make this working you will need x86 compilers supporting pragma simd either via OpenMP4 or via CilkPlus (very recent gcc or Intel compiler).

Matrix Multiplication optimization via matrix transpose

I am working on an assignment where I transpose a matrix to reduce cache misses for a matrix multiplication operation. From what I understand from a few classmates, I should get 8x improvement. However, I am only getting 2x ... what might I be doing wrong?
Full Source on GitHub
void transpose(int size, matrix m) {
int i, j;
for (i = 0; i < size; i++)
for (j = 0; j < size; j++)
std::swap(m.element[i][j], m.element[j][i]);
}
void mm(matrix a, matrix b, matrix result) {
int i, j, k;
int size = a.size;
long long before, after;
before = wall_clock_time();
// Do the multiplication
transpose(size, b); // transpose the matrix to reduce cache miss
for (i = 0; i < size; i++)
for (j = 0; j < size; j++) {
int tmp = 0; // save memory writes
for(k = 0; k < size; k++)
tmp += a.element[i][k] * b.element[j][k];
result.element[i][j] = tmp;
}
after = wall_clock_time();
fprintf(stderr, "Matrix multiplication took %1.2f seconds\n", ((float)(after - before))/1000000000);
}
Am I doing things right so far?
FYI: The next optimization I need to do is use SIMD/Intel SSE3
Am I doing things right so far?
No. You have a problem with your transpose. You should have seen this problem before you started worrying about performance. When you are doing any kind of hacking around for optimizations it always a good idea to use the naive but suboptimal implementation as a test. An optimization that achieves a factor of 100 speedup is worthless if it doesn't yield the right answer.
Another optimization that will help is to pass by reference. You are passing copies. In fact, your matrix result may never get out because you are passing copies. Once again, you should have tested.
Yet another optimization that will help the speedup is to cache some pointers. This is still quite slow:
for(k = 0; k < size; k++)
tmp += a.element[i][k] * b.element[j][k];
result.element[i][j] = tmp;
An optimizer might see a way around the pointer problems, but probably not. At least not if you don't use the nonstandard __restrict__ keyword to tell the compiler that your matrices don't overlap. Cache pointers so you don't have to do a.element[i], b.element[j], and result.element[i]. And it still might help to tell the compiler that these arrays don't overlap with the __restrict__ keyword.
Addendum
After looking over the code, it needs help. A minor comment first. You aren't writing C++. Your code is C with a tiny hint of C++. You're using struct rather than class, malloc rather than new, typedef struct rather than just struct, C headers rather than C++ headers.
Because of your implementation of your struct matrix, my comment on slowness due to copy constructors was incorrect. That it was incorrect is even worse! Using the implicitly-defined copy constructor in conjunction with classes or structs that contain naked pointers is playing with fire. You will get burned very badly if someone calls m(a, a, a_squared) to get the square of matrix a. You will get burned even worse if some expects m(a, a, a) to do an in-place computation of a2.
Mathematically, your code only covers a tiny portion of the matrix multiplication problem. What if someone wants to multiply a 100x1000 matrix by a 1000x200 matrix? That's perfectly valid, but your code doesn't handle it because your code only works with square matrices. On the other hand, your code will let someone multiply a 100x100 matrix by a 200x200 matrix, which doesn't make a bit of sense.
Structurally, your code has close to a 100% guarantee that it will be slow because of your use of ragged arrays. malloc can spritz the rows of your matrices all across memory. You'll get much better performance if the matrix is internally represented as a contiguous array but is accessed as if it were a NxM matrix. C++ provides some nice mechanisms for doing just that.
If your assignment implies that you MUST transpose, then, of course, you should correct your transpose procedure. As it stands, it does the transpose TWO times, resulting in no transpose at all. The j=loop should not read
j=0; j<size; j++
but
j=0; j<i; j++
Transposing is not necessary to avoid processing the elements of one of the factor-matrices in the "wrong" order. Just interchange the j-loop and the k-loop. Leaving aside for the moment any (other) performance-tuning, the basic loop-structure should be:
for (int i=0; i<size; i++)
{
for (int k=0; k<size; k++)
{
double tmp = a[i][k];
for (int j=0; j<size; j++)
{
result[i][j] += tmp * b[k][j];
}
}
}

C++ performance: checking a block of memory for having specific values in specific cells

I'm doing a research on 2D Bin Packing algorithms. I've asked similar question regarding PHP's performance - it was too slow to pack - and now the code is converted to C++.
It's still pretty slow. What my program does is consequently allocating blocks of dynamic memory and populating them with a character 'o'
char* bin;
bin = new (nothrow) char[area];
if (bin == 0) {
cout << "Error: " << area << " bytes could not be allocated";
return false;
}
for (int i=0; i<area; i++) {
bin[i]='o';
}
(their size is between 1kb and 30kb for my datasets)
Then the program checks different combinations of 'x' characters inside of current memory block.
void place(char* bin, int* best, int width)
{
for (int i=best[0]; i<best[0]+best[1]; i++)
for (int j=best[2]; j<best[2]+best[3]; j++)
bin[i*width+j] = 'x';
}
One of the functions that checks the non-overlapping gets called millions of times during a runtime.
bool fits(char* bin, int* pos, int width)
{
for (int i=pos[0]; i<pos[0]+pos[1]; i++)
for (int j=pos[2]; j<pos[2]+pos[3]; j++)
if (bin[i*width+j] == 'x')
return false;
return true;
}
All other stuff takes only a percent of the runtime, so I need to make these two guys (fits and place) faster. Who's the culprit?
Since I only have two options 'x' and 'o', I could try to use just one bit instead of the whole byte the char takes. But I'm more concerned with the speed, you think it would make the things faster?
Thanks!
Update: I replaced int* pos with rect pos (the same for best), as MSalters suggested. At first I saw improvement, but I tested more with bigger datasets and it seems to be back to normal runtimes. I'll try other techniques suggested and will keep you posted.
Update: using memset and memchr sped up things about twice. Replacing 'x' and 'o' with '\1' and '\0' didn't show any improvement. __restrict wasn't helpful either. Overall, I'm satisfied with the performance of the program now since I also made some improvements to the algorithm itself. I'm yet to try using a bitmap and compiling with -02 (-03)... Thanks again everybody.
Best possibility would be to use an algorithm with better complexity.
But even your current algorithm could be sped up. Try using SSE instructions to test ~16 bytes at once, also you can make a single large allocation and split it yourself, this will be faster than using the library allocator (the library allocator has the advantage of letting you free blocks individually, but I don't think you need that feature).
[ Of course: profile it!]
Using a bit rather than a byte will not be faster in the first instance.
However, consider that with characters, you can cast blocks of 4 or 8 bytes to unsigned 32 bit or 64 bit integers (making sure you handle alignment), and compare that to the value for 'oooo' or 'oooooooo' in the block. That allows a very fast compare.
Now having gone down the integer approach, you can see that you could do that same with the bit approach and handle say 64 bits in a single compare. That should surely give a real speed up.
Bitmaps will increase the speed as well, since they involve touching less memory and thus will cause more memory references to come from the cache. Also, in place, you might want to copy the elements of best into local variables so that the compiler knows that your writes to bin will not change best. If your compiler supports some spelling of restrict, you might want to use that as well. You can also replace the inner loop in place with the memset library function, and the inner loop in fits with memchr; those may not be large performance improvements, though.
First of all, have you remembered to tell your compiler to optimize?
And turn off slow array index bounds checking and such?
That done, you will get substantial speed-up by representing your binary values as individual bits, since you can then set or clear say 32 or 64 bits at a time.
Also I would tend to assume that the dynamic allocations would give a fair bit of overhead, but apparently you have measured and found that it isn't so. If however the memory management actually contributes significantly to the time, then a solution depends a bit on the usage pattern. But possibly your code generates stack-like alloc/free behavior, in which case you can optimize the allocations down to almost nothing; just allocate a big chunk of memory at the start and then sub-allocate stack-like from that.
Considering your current code:
void place(char* bin, int* best, int width)
{
for (int i=best[0]; i<best[0]+best[1]; i++)
for (int j=best[2]; j<best[2]+best[3]; j++)
bin[i*width+j] = 'x';
}
Due to possible aliasing the compiler may not realize that e.g. best[0] will be constant during the loop.
So, tell it:
void place(char* bin, int const* best, int const width)
{
int const maxY = best[0] + best[1];
int const maxX = best[2] + best[3];
for( int y = best[0]; y < maxY; ++y )
{
for( int x = best[2]; x < maxX; ++x )
{
bin[y*width + x] = 'x';
}
}
}
Most probably your compiler will hoist the y*width computation out of the inner loop, but why not tell it do also that:
void place(char* bin, int* best, int const width)
{
int const maxY = best[0]+best[1];
int const maxX = best[2]+best[3];
for( int y = best[0]; y < maxY; ++y )
{
int const startOfRow = y*width;
for( int x = best[2]; x < maxX; ++x )
{
bin[startOfRow + x] = 'x';
}
}
}
This manual optimization (also applied to other routine) may or may not help, it depends on how smart your compiler is.
Next, if that doesn't help enough, consider replacing inner loop with std::fill (or memset), doing a whole row in one fell swoop.
And if that doesn't help or doesn't help enough, switch over to bit-level representation.
It is perhaps worth noting and trying out, that every PC has built-in hardware support for optimizing the bit-level operations, namely a graphics accelerator card (in old times called blitter chip). So, you might just use an image library and a black/white bitmap. But since your rectangles are small I'm not sure whether the setup overhead will outweight the speed of the actual operation – needs to be measured. ;-)
Cheers & hth.,
The biggest improvement I'd expect is from a non-trivial change:
// changed pos to class rect for cleaner syntax
bool fits(char* bin, rect pos, int width)
{
if (bin[pos.top()*width+pos.left()] == 'x')
return false;
if (bin[(pos.bottom()-1*width+pos.right()] == 'x')
return false;
if (bin[(pos.bottom()*width+pos.left()] == 'x')
return false;
if (bin[pos.top()*width+pos.right()] == 'x')
return false;
for (int i=pos.top(); i<=pos.bottom(); i++)
for (int j=pos.left(); j<=pos.right(); j++)
if (bin[i*width+j] == 'x')
return false;
return true;
}
Sure, you're testing bin[(pos.bottom()-1*width+pos.right()] twice. But the first time you do so is much earlier in the algorithm. You add boxes, which means that there is a strong correlation between adjacent bins. Therefore, by checking the corners first, you often return a lot earlier. You could even consider adding a 5th check in the middle.
Beyond the obligatory statement about using a profiler,
The advice above about replacing things with a bit map is a very good idea. If that does not appeal to you..
Consider replacing
for (int i=0; i<area; i++) {
bin[i]='o';
}
By
memset(bin, 'o', area);
Typically a memset will be faster, as it compiles into less machine code.
Also
void place(char* bin, int* best, int width)
{
for (int i=best[0]; i<best[0]+best[1]; i++)
for (int j=best[2]; j<best[2]+best[3]; j++)
bin[i*width+j] = 'x';
}
has a bit of room.for improvement
void place(char* bin, int* best, int width)
{
for (int i=best[0]; i<best[0]+best[1]; i++)
memset( (i * width) + best[2],
'x',
(best[2] + best[3]) - (((i * width)) + best[2]) + 1);
}
by eliminating one of the loops.
A last idea is to change your data representation.
Consider using the '\0' character as a replacement for your 'o' and '\1' as a replacement for your 'x' character. This is sort of like using a bit map.
This would enable you to test like this.
if (best[1])
{
// Is a 'x'
}
else
{
// Is a 'o'
}
Which might produce faster code. Again the profiler is your friend :)
This representation would also enable you to simply sum a set of character to determine how many 'x's and 'o's there are.
int sum = 0;
for (int i = 0; i < 12; i++)
{
sum += best[i];
}
cout << "There are " << sum << "'x's in the range" << endl;
Best of luck to you
Evil.
If you have 2 values for your basic type, I would first try to use bool. Then the compiler knows you have 2 values and might be able to optimize some things better.
Appart from that add const where possible (for example the parameter of fits( bool const*,...)).
I'd think about memory cache breaks. These functions run through sub-matrices inside a bigger matrix - I suppose many times much bigger on both width and height.
That means the small matrix lines are contiguous memory but between lines it might break memory cache pages.
Consider representing the big matrix cells in memory in an order that would keep sub-matrices elements close to each other as possible. That is instead of keeping a vector of contiguous full lines. First option comes to my mind, is to break your big matrix recursively to matrices of size [ 2^i, 2^i ] ordered { top-left, top-right, bottom-left, bottom-right }.
1)
i.e. if your matrix is size [X,Y], represented in an array of size X*Y, then element [x,y] is at position(x,y) in the array:
use instead of (y*X+x):
unsigned position( rx, ry )
{
unsigned x = rx;
unsigned y = rx;
unsigned part = 1;
unsigned pos = 0;
while( ( x != 0 ) && ( y != 0 ) ) {
unsigned const lowest_bit_x = ( x % 2 );
unsigned const lowest_bit_y = ( y % 2 );
pos += ( ((2*lowest_bit_y) + lowest_bit_x) * part );
x /= 2; //throw away lowest bit
y /= 2;
part *= 4; //size grows by sqare(2)
}
return pos;
}
I didn't check this code, just to explain what I mean.
If you need, also try to find a faster way to implement.
but note that the array you allocate will be bigger than X*Y, it has to be the smaller possible (2^(2*k)), and that would be wastefull unless X and Y are about same size scale. But it can be solved by further breaking the big matrix to sqaures first.
And then cache benfits might outwight the more complex position(x,y).
2) then try to find the best way to run through the elements of a sub-matrix in fits() and place(). Not sure yet what it is, not necessarily like you do now. Basically a sub-matrix of size [x,y] should break into no more than y*log(x)*log(y) blocks that are contiguous in the array representation, but they all fit inside no more than 4 blocks of size 4*x*y. So finally, for matrices that are smaller than a memory cache page, you'll get no more than 4 memory cache breaks, while your original code could break y times.