Am I using the SSE resources most efficiently? - c++

I have this piece of code to find how many pixels of key lie in low/high range.
The low-high matrix is generated from an input big matrix. I have to output the coordinates of low/high where number of matching pixels is greater than 150(of 256).
int8_t high[8192][8192];
int8_t low[8192][8192];
int8_t key[16][16]
for (int i = 0; i <= 8192 - 16; i++)
for (int j = 0; j <= 8192 - 16; j++)
{
char *kLoc = key[ii];
char *lLoc = low[i + ii] + j;
char *hLoc = high[i + ii] + j;
__m128i high, low, num;
low = _mm_loadu_si128((__m128i*)lLoc);
high = _mm_loadu_si128((__m128i*)hLoc);
num = _mm_loadu_si128((__m128i*)kLoc);
// Snip
}
Can this be made better?
I understand there are 8 128-bit XMM registers and also MMX registers, whereas i am using just 3 of available XMM registers. Can I optimize the code to make use of all registers?

When it come to optimization, don't guess, measure, and measure on target/production environment. For simple arithmetic the bottleneck is usually on memory bandwidth, so you may consider doing other things in between load and save. You may also reduce to one compare instead of 2 compare+merge result, by reorder and interleave the loop into:
num_1 = loadu(hLoc); hLoc++;
num_2 = loadu(hLoc); hLoc++;
low_1 = loadu(kLoc); kLoc++;
low_2 = loadu(kLoc); kLoc++;
high_1 = loadu(lLoc); lLoc++;
high_2 = loadu(lLoc); lLoc++;
num1 = mm_sub(low1)
num2 = mm_sub(low2)
cmp num1, high1
cmp num2, high2
store num1
store num2
You may also want to move kLoc, lLoc, hLoc outside the loop and do increment(ie. kLoc++) as above, some compiler so dumb and generate code that calculate the address in every loop.

Related

C/C++ fast absolute difference between two series

i am interested in generating efficient c/c++ code to get the differences between two time series.
More precise: The time series values are stored as uint16_t arrays with fixed and equal length == 128.
I am good with a pure c as well as a pure c++ implementation. My code examples are in c++
My intentions are:
Let A,B and C be discrete time series of length l with a value-type of uint16_t.
Vn[n<l]: Cn = |An - Bn|;
What i can think of in pseudo code:
for index i:
if a[i] > b[i]:
c[i] = a[i] - b[i]
else:
c[i] = b[i] - a[i]
Or in c/c++
for(uint8_t idx = 0; idx < 128; idx++){
c[i] = a[i] > b[i] ? a[i] - b[i] : b[i] - a[i];
}
But i really dont like the if/else statement in the loop.
I am okay with looping - this can be unrolled by the compiler.
Somewhat like:
void getBufDiff(const uint16_t (&a)[], const uint16_t (&b)[], uint16_t (&c)[]) {
#pragma unroll 16
for (uint8_t i = 0; i < 128; i++) {
c[i] = a[i] > b[i] ? a[i] - b[i] : b[i] - a[i];
}
#end pragma
}
What i am looking for is a 'magic code' which speeds up the if/else and gets me the absolute difference between the two unsigned values.
I am okay with a +/- 1 precision (In case this would allow some bit-magic to happen). I am also okay with changing the data-type to get faster results. And i am also okay with dropping the loop for something else.
So something like:
void getBufDiff(const uint16_t (&a)[], const uint16_t (&b)[], uint16_t (&c)[]) {
#pragma unroll 16
for (uint8_t i = 0; i < 128; i++) {
c[i] = magic_code_for_abs_diff(a[i],b[i]);
}
#end pragma
}
Did try XORing the two values. Gives proper results only for one of the cases.
EDIT 2:
Did a quick test on different approaches on my Laptop.
For 250000000 entrys this is the performance (256 rounds):
c[i] = a[i] > b[i] ? a[i] - b[i] : b[i] - a[i]; ~500ms
c[i] = std::abs(a[i] - b[i]); ~800ms
c[i] = ((a[i] - b[i]) + ((a[i] - b[i]) >> 15)) ^ (i >> 15) ~425ms
uint16_t tmp = (a[i] - b[i]); c[i] = tmp * ((tmp > 0) - (tmp < 0)); ~600ms
uint16_t ret[2] = { a[i] - b[i], b[i] - a[i] };c[i] = ret[a[i] < b[i]] ~900ms
c[i] = ((a[i] - b[i]) >> 31 | 1) * (a[i] - b[i]); ~375ms
c[i] = ((a[i] - b[i])) ^ ((a[i] - b[i]) >> 15) ~425ms
Your problem is a good candidate for SIMD. GCC can do it automatically, here is a simplified example: https://godbolt.org/z/36nM8bYYv
void absDiff(const uint16_t* a, const uint16_t* b, uint16_t* __restrict__ c)
{
for (uint8_t i = 0; i < 16; i++)
c[i] = a[i] - b[i];
}
Note that I added __restrict__ to enable autovectorization, otherwise the compiler has to assume your arrays may overlap and it isn't safe to use SIMD (because some writes could change future reads in the loop).
I simplified it to just 16 at a time, and removed the absolute value for the sake of illustration. The generated assembly is:
vld1.16 {q9}, [r0]!
vld1.16 {q11}, [r1]!
vld1.16 {q8}, [r0]
vld1.16 {q10}, [r1]
vsub.i16 q9, q9, q11
vsub.i16 q8, q8, q10
vst1.16 {q9}, [r2]!
vst1.16 {q8}, [r2]
bx lr
That means it loads 8 integers at once from a, then from b, repeats that once, then does 8 subtracts at once, then again, then stores 8 values twice into c. Many fewer instructions than without SIMD.
Of course it requires benchmarking to see if this is actually faster on your system (after you add back the absolute value part, I suggest using your ?: approach which does not defeat autovectorization), but I expect it will be significantly faster.
Try to let the compiler see the conditional lane-selection pattern for SIMD instructions like this (pseudo code):
// store a,b to SIMD registers
for(0 to 32)
a[...] = input[...]
b[...] = input2[...]
// single type operation, easily parallelizable
for(0 to 32)
vector1[...] = a[...] - b[...]
// single type operation, easily parallelizable
// maybe better to compute b-a to decrease dependency to first step
// since a and b are already in SIMD registers
for(0 to 32)
vector2[...] = -vector1[...]
// single type operation, easily parallelizable
// re-use a,b registers, again
for(0 to 32)
vector3[...] = a[...] < b[...]
// x84 architecture has SIMD instructions for this
// operation is simple, no other calculations inside, just 3 inputs, 1 out
// all operands are registers (at least should be, if compiler works fine)
for(0 to 32)
vector4[...] = vector3[...] ? vector2[...]:vetor1[...]
If you write your benchmark codes, I can compare this with other solutions. But it wouldn't matter for good compilers (or good compiler flags) that do same thing automatically for the first benchmark code in question.
Fast abs (under two complement integers) can be implemented as (x + (x >> N)) ^ (x >> N) where N is the size of int - 1, i.e. 15 in your case. That's a possible implementation of std::abs. Still you can try it
– answer by freakish
Since you write "I am okay with a +/- 1 precision", you can use a XOR-solution: instead of abs(x), do x ^ (x >> 15). This will give an off-by-1 result for negative values.
If you want to calculate the correct result even for negative values, use the other answer (with x >> 15 correction).
In any case, this XOR-trick only works if overflow is impossible. The compiler can't replace abs by code which uses XOR because of that.

Most efficient way to test a 256-bit YMM AVX register element for equal or less than zero

I'm implementing a particle system using Intel AVX intrinsics. When the Y-position of a particle is less than or equal to zero I want to reset the particle.
The particle system is ordered in a SOA-pattern like this:
class ParticleSystem
{
private:
float* mXPosition;
float* mYPosition;
float* mZPosition;
.... Rest of code not important for this question
My initial approach I had in mind was just to iterate through the mYPosition array and check for the case stated in the beginning. Perhaps some performance improvmentes could be made with this approach?
The question however is if there is any efficient way to implement this
using the AVX intrinsics? Thank you!
If the elements which are <= 0 are relatively sparse then one simple approach is to test 8 at a time using AVX and then drop into scalar code when you identify a vector which contains one or more such elements, e.g.:
#include <immintrin.h> // AVX intrinsics
const __m256 vk0 = _mm256_setzero_ps(); // const vector of zeros
for (int i = 0; i + 8 <= n; i += 8)
{
__m256 vy = _mm256_loadu_ps(&mYPosition[i]); // load 8 x floats
__m256 vcmp = _mm256_cmp_ps(vy, vk0, _CMP_LE_OS); // compare for <= 0
int mask = _mm256_movemask_ps(vcmp); // get MS bits from comparison result
if (mask != 0) // if any bits set
{ // then we have 1 or more elements <= 0
for (int k = 0; k < 8; ++k) // test each element in vector
{ // using scalar code...
if ((mask & 1) != 0)
{
// found element at index i + k
// do something with it...
}
mask >>= 1;
}
}
}
// deal with any remaining elements in case where n is not a multiple of 8
for (int j = i; j < n; ++j)
{
if (mYPosition[j] <= 0.0f)
{
// found element at index j
// do something with it...
}
}
Of course if the matching elements are not sparse, i.e. if you are typically finding one or more in every vector of 8, then this isn't going to buy you any performance gain. However if the elements are sparse, such that most vectors can be skipped, then you should see a significant benefit.

AVX2 Winner-Take-All Disparity Search

I am optimizing the "winner-take-all" portion of a disparity estimation algorithm using AVX2. My scalar routine is accurate, but at QVGA resolution and 48 disparities the runtime is disappointingly slow at ~14 ms on my laptop. I create both LR and RL disparity images, but for simplicity here I will only include code for the RL search.
My scalar routine:
int MAXCOST = 32000;
for (int i = maskRadius; i < rstep-maskRadius; i++) {
// WTA "RL" Search:
for (int j = maskRadius; j+maskRadius < cstep; j++) {
int minCost = MAXCOST;
int minDisp = 0;
for (int d = 0; d < numDisp && j+d < cstep; d++) {
if (asPtr[(i*numDisp*cstep)+(d*cstep)+j] < minCost) {
minCost = asPtr[(i*numDisp*cstep)+(d*cstep)+j];
minDisp = d;
}
}
dRPtr[(i*cstep)+j] = minDisp;
}
}
My attempt at using AVX2:
int MAXCOST = 32000;
int* dispVals = (int*) _mm_malloc( sizeof(int32_t)*16, 32 );
for (int i = maskRadius; i < rstep-maskRadius; i++) {
// WTA "RL" Search AVX2:
for( int j = 0; j < cstep-16; j+=16) {
__m256i minCosts = _mm256_set1_epi16( MAXCOST );
__m128i loMask = _mm_setzero_si128();
__m128i hiMask = _mm_setzero_si128();
for (int d = 0; d < numDisp && j+d < cstep; d++) {
// Grab 16 costs to compare
__m256i costs = _mm256_loadu_si256((__m256i*) (asPtr[(i*numDisp*cstep)+(d*cstep)+j]));
// Get the new minimums
__m256i newMinCosts = _mm256_min_epu16( minCosts, costs );
// Compare new mins to old to build mask to store minDisps
__m256i mask = _mm256_cmpgt_epi16( minCosts, newMinCosts );
__m128i loMask = _mm256_extracti128_si256( mask, 0 );
__m128i hiMask = _mm256_extracti128_si256( mask, 1 );
// Sign extend to 32bits
__m256i loMask32 = _mm256_cvtepi16_epi32( loMask );
__m256i hiMask32 = _mm256_cvtepi16_epi32( hiMask );
__m256i currentDisp = _mm256_set1_epi32( d );
// store min disps with mask
_mm256_maskstore_epi32( dispVals, loMask32, currentDisp ); // RT error, why?
_mm256_maskstore_epi32( dispVals+8, hiMask32, currentDisp ); // RT error, why?
// Set minCosts to newMinCosts
minCosts = newMinCosts;
}
// Write the WTA minimums one-by-one to the RL disparity image
int index = (i*cstep)+j;
for( int k = 0; k < 16; k++ ) {
dRPtr[index+k] = dispVals[k];
}
}
}
_mm_free( dispVals );
The Disparity Space Image (DSI) is of size HxWxD (320x240x48), which I lay out horizontally for better memory accesses, such that each row is of size WxD.
The Disparity Space Image has per-pixel matching costs. This aggregated
with a simple box filter to make another image of the exact same size,
but with costs summed over, say, a 3x3 or 5x5 window. This smoothing makes
the result more 'robust'. When I am accessing with asPtr, I am indexing
into this aggregated costs image.
Also, in an effort to save on unnecessary computation, I have been starting
and ending on rows offset by a mask radius. This mask radius is the radius
of my census mask. I could be doing some fancy border reflection, but it is
simpler and faster just to not bother with the disparity for this border.
This of course applies to the beginning and ending cols too, but messing with
indexing here is not good when I am forcing my entire algorithm to run only
on images whose columns are a multiple of 16 (ex. QVGA: 320x240) so that I
can index simply and hit everything with SIMD (no residual scalar processing).
Also, if you think my code is a mess, I encourage you to check out the
the highly optimized OpenCV stereo algorithms. I find them impossible and have been able to make little to no use of them.
My code compiles but fails at runtime. I am using VS 2012 Express Update 4. When I run with the debugger I am unable to gain any insights. I am relatively new to using intrinsics and so I am not sure what information I should expect to see when debugging, number of registers, whether __m256i variables should be visible, etc.
Heeding comment advice below, I improved the scalar time from ~14 to ~8 by using smarter indexing. My CPU is an i7-4980HQ and I successfully use AVX2 intrinsics elsewhere in the same file.
I still haven't found the problem, but I did see some things you might want to change. You're not checking the return value of _mm_malloc, though. If it's failing, that would explain it. (Maybe it doesn't like allocating 32-byte aligned memory?)
If you're running your code under a memory checker or something, then maybe it doesn't like reading from uninitialized memory for dispVals. (_mm256_maskstore_epi32 may count as a read-modify-write even if the mask is all-ones.)
Run your code under a debugger and find out what's going wrong. "runtime error" is not very meaningful.
_mm_set1* functions are slow-ish. VPBROADCASTD needs its source in memory or a vector reg, not a GP reg, so the compiler can either movd from a GP reg to a vector reg and then broadcast, or store to memory and then broadcast. Anyway, it would be faster to do
const __m256i add1 = _mm256_set1_epi32( 1 );
__m256i dvec = _mm256_setzero_si256();
for (d;d...;d++) {
dvec = _mm256_add_epi32(dvec, add1);
}
Other stuff:
This will probably run faster if you aren't storing to memory every iteration of the inner loop. Use a blend instruction (_mm256_blendv_epi8), or something like that, to update the vector(s) of displacements that go with the min costs. Blend = masked move with a register destination.
Also, your displacement values should fit in 16b integers, so don't sign-extend them to 32b until AFTER you're done finding them. Intel CPUs can sign-extend a 16b memory location into gp register on the fly with no speed penalty (movsz is as fast as mov), so prob. just declare your dRPtr array as uint16_t. Then you don't need the sign-extending stuff in your vector code at all (let alone in your inner loop!). Hopefully _mm256_extracti128_si256( mask, 0 ) compiles to nothing, since the 128 you want is already the low128, so just use the reg as the src for vmovsx, but still.
You can also save an instruction (and a fused-domain uop) by not loading first. (unless the compiler is smart enough not to elide the vmovdqu and use vpminuw with a memory operand, even though you used the load intrinsic).
So I'm thinking something like this:
// totally untested, didn't even check that this compiles.
for(i) { for(j) {
// inner loop, compiler can hoist these constants.
const __m256i add1 = _mm256_set1_epi16( 1 );
__m256i dvec = _mm256_setzero_si256();
__m256i minCosts = _mm256_set1_epi16( MAXCOST );
__m256i minDisps = _mm256_setzero_si256();
for (int d=0 ; d < numDisp && j+d < cstep ;
d++, dvec = _mm256_add_epi16(dvec, add1))
{
__m256i newMinCosts = _mm256_min_epu16( minCosts, asPtr[(i*numDisp*cstep)+(d*cstep)+j]) );
__m256i mask = _mm256_cmpgt_epi16( minCosts, newMinCosts );
minDisps = _mm256_blendv_epi8(minDisps, dvec, mask); // 2 uops, latency=2
minCosts = newMinCosts;
}
// put sign extension here if making dRPtr uint16_t isn't an option.
int index = (i*cstep)+j;
_mm256_storeu_si256 (dRPtr + index, __m256i minDisps);
}}
You might get better performance having two parallel dependency chains: minCosts0 / minDisps0, and minCosts1 / minDisps1, and then combining them at the end. minDisps is a loop-carried dependency, but the loop only has 5 instructions (including the vpadd, which looks like loop overhead but can't be reduced by unrolling). They decode to 6 uops (blendv is 2), plus loop overhead. It should run in 1.5cycles / iteration (not counting loop overhead) on haswell, but the dep chain would limit it to one iteration per 2 cycles. (Assuming unrolling to get rid of loop overhead). Doing two dep chains in parallel fixes this, and has the same effect as unrolling the loop: less loop overhead.
Hmm, actually on Haswell,
pminuw can run on p1/p5. (and the load part on p2/p3)
pcmpgtw can run on p1/p5
vpblendvb is 2 uops for p5.
padduw can run on p1/p5
movdqa reg,reg can run on p0/p1/p5 (and may not need an execution unit at all). Unrolling should get rid of any overhead for minCosts = newMinCosts, since the compiler can just end up with newMinCosts from the last unrolled loop body in the right register for the first loop body of the next iteration.
fused sub / jge (loop counter) can run on p6. (using PTEST + jcc on dvec would be slower). add/sub can run on p0/p1/p5/p6 when not fused with a jcc.
Ok, so actually the loop will take 2.5 cycles per iteration, limited by instructions that can only run on p1/p5. Unrolling by 2 or 4 will reduce the loop / movdqa overhead. Since Haswell can issue 4 uops per clock, it can then more efficiently queue up uops for out-of-order execution, since the loop won't have a super-high number of iterations. (48 was your example.) Having lots of uops queued up will give the CPU something to do after leaving the loop, and hide any latencies from cache misses, etc.
_mm256_min_epu16 (PMINUW) is another loop-carried dependency chain. Using it with a memory operand makes it a 3 or 4-cycle latency. However, the load part of the instruction can start as soon as the address is known, so folding a load into a modify op to take advantage of micro-fusion doesn't make dep chains any longer or shorter than using a separate load.
Sometimes you need to use a separate load, for unaligned data (AVX removed the alignment requirement for memory operands). We're limited more by execution units than the 4 uop / clock issue limit, so it's probably fine to use a dedicated load instruction.
source for insn ports / latencies.
Before you go and do platform specific optimizations, there are plenty of portable optimizations that could be performed. Extract loop invariants, convert index multiplies to increment additions, etc...
This may not be exact, but gets the general idea across:
int MAXCOST = 32000, numDispXcstep = numDisp*cstep;
for (int i = maskRadius; i < rstep - maskRadius; i+=numDispXcstep) {
for (int j = maskRadius; j < cstep - maskRadius; j++) {
int minCost = MAXCOST, minDisp = 0;
for (int d = 0; d < numDispXcstep - j; d+=cstep) {
if (asPtr[i+j+d] < minCost) {
minCost = asPtr[i+j+d];
minDisp = d;
}
}
dRPtr[i/numDisp+j] = minDisp;
}
}
Once you have done this it becomes apparent what is actually occurring. It looks like "i" is the largest step, followed by "d" with "j" actually being the variable that operates on sequential data. ... the next step would be to reorder the loops accordingly and if you still need further optimizations, apply platform specific intrinsics.

Why is transposing a matrix of 512x512 much slower than transposing a matrix of 513x513?

After conducting some experiments on square matrices of different sizes, a pattern came up. Invariably, transposing a matrix of size 2^n is slower than transposing one of size 2^n+1. For small values of n, the difference is not major.
Big differences occur however over a value of 512. (at least for me)
Disclaimer: I know the function doesn't actually transpose the matrix because of the double swap of elements, but it makes no difference.
Follows the code:
#define SAMPLES 1000
#define MATSIZE 512
#include <time.h>
#include <iostream>
int mat[MATSIZE][MATSIZE];
void transpose()
{
for ( int i = 0 ; i < MATSIZE ; i++ )
for ( int j = 0 ; j < MATSIZE ; j++ )
{
int aux = mat[i][j];
mat[i][j] = mat[j][i];
mat[j][i] = aux;
}
}
int main()
{
//initialize matrix
for ( int i = 0 ; i < MATSIZE ; i++ )
for ( int j = 0 ; j < MATSIZE ; j++ )
mat[i][j] = i+j;
int t = clock();
for ( int i = 0 ; i < SAMPLES ; i++ )
transpose();
int elapsed = clock() - t;
std::cout << "Average for a matrix of " << MATSIZE << ": " << elapsed / SAMPLES;
}
Changing MATSIZE lets us alter the size (duh!). I posted two versions on ideone:
size 512 - average 2.46 ms - http://ideone.com/1PV7m
size 513 - average 0.75 ms - http://ideone.com/NShpo
In my environment (MSVS 2010, full optimizations), the difference is similar :
size 512 - average 2.19 ms
size 513 - average 0.57 ms
Why is this happening?
The explanation comes from Agner Fog in Optimizing software in C++ and it reduces to how data is accessed and stored in the cache.
For terms and detailed info, see the wiki entry on caching, I'm gonna narrow it down here.
A cache is organized in sets and lines. At a time, only one set is used, out of which any of the lines it contains can be used. The memory a line can mirror times the number of lines gives us the cache size.
For a particular memory address, we can calculate which set should mirror it with the formula:
set = ( address / lineSize ) % numberOfsets
This sort of formula ideally gives a uniform distribution across the sets, because each memory address is as likely to be read (I said ideally).
It's clear that overlaps can occur. In case of a cache miss, the memory is read in the cache and the old value is replaced. Remember each set has a number of lines, out of which the least recently used one is overwritten with the newly read memory.
I'll try to somewhat follow the example from Agner:
Assume each set has 4 lines, each holding 64 bytes. We first attempt to read the address 0x2710, which goes in set 28. And then we also attempt to read addresses 0x2F00, 0x3700, 0x3F00 and 0x4700. All of these belong to the same set. Before reading 0x4700, all lines in the set would have been occupied. Reading that memory evicts an existing line in the set, the line that initially was holding 0x2710. The problem lies in the fact that we read addresses that are (for this example) 0x800 apart. This is the critical stride (again, for this example).
The critical stride can also be calculated:
criticalStride = numberOfSets * lineSize
Variables spaced criticalStride or a multiple apart contend for the same cache lines.
This is the theory part. Next, the explanation (also Agner, I'm following it closely to avoid making mistakes):
Assume a matrix of 64x64 (remember, the effects vary according to the cache) with an 8kb cache, 4 lines per set * line size of 64 bytes. Each line can hold 8 of the elements in the matrix (64-bit int).
The critical stride would be 2048 bytes, which correspond to 4 rows of the matrix (which is continuous in memory).
Assume we're processing row 28. We're attempting to take the elements of this row and swap them with the elements from column 28. The first 8 elements of the row make up a cache line, but they'll go into 8 different cache lines in column 28. Remember, critical stride is 4 rows apart (4 consecutive elements in a column).
When element 16 is reached in the column (4 cache lines per set & 4 rows apart = trouble) the ex-0 element will be evicted from the cache. When we reach the end of the column, all previous cache lines would have been lost and needed reloading on access to the next element (the whole line is overwritten).
Having a size that is not a multiple of the critical stride messes up this perfect scenario for disaster, as we're no longer dealing with elements that are critical stride apart on the vertical, so the number of cache reloads is severely reduced.
Another disclaimer - I just got my head around the explanation and hope I nailed it, but I might be mistaken. Anyway, I'm waiting for a response (or confirmation) from Mysticial. :)
As an illustration to the explanation in Luchian Grigore's answer, here's what the matrix cache presence looks like for the two cases of 64x64 and 65x65 matrices (see the link above for details on numbers).
Colors in the animations below mean the following:
– not in cache,
– in cache,
– cache hit,
– just read from RAM,
– cache miss.
The 64x64 case:
Notice how almost every access to a new row results in a cache miss. And now how it looks for the normal case, a 65x65 matrix:
Here you can see that most of the accesses after the initial warming-up are cache hits. This is how CPU cache is intended to work in general.
The code that generated frames for the above animations can be seen here.
Luchian gives an explanation of why this behavior happens, but I thought it'd be a nice idea to show one possible solution to this problem and at the same time show a bit about cache oblivious algorithms.
Your algorithm basically does:
for (int i = 0; i < N; i++)
for (int j = 0; j < N; j++)
A[j][i] = A[i][j];
which is just horrible for a modern CPU. One solution is to know the details about your cache system and tweak the algorithm to avoid those problems. Works great as long as you know those details.. not especially portable.
Can we do better than that? Yes we can: A general approach to this problem are cache oblivious algorithms that as the name says avoids being dependent on specific cache sizes [1]
The solution would look like this:
void recursiveTranspose(int i0, int i1, int j0, int j1) {
int di = i1 - i0, dj = j1 - j0;
const int LEAFSIZE = 32; // well ok caching still affects this one here
if (di >= dj && di > LEAFSIZE) {
int im = (i0 + i1) / 2;
recursiveTranspose(i0, im, j0, j1);
recursiveTranspose(im, i1, j0, j1);
} else if (dj > LEAFSIZE) {
int jm = (j0 + j1) / 2;
recursiveTranspose(i0, i1, j0, jm);
recursiveTranspose(i0, i1, jm, j1);
} else {
for (int i = i0; i < i1; i++ )
for (int j = j0; j < j1; j++ )
mat[j][i] = mat[i][j];
}
}
Slightly more complex, but a short test shows something quite interesting on my ancient e8400 with VS2010 x64 release, testcode for MATSIZE 8192
int main() {
LARGE_INTEGER start, end, freq;
QueryPerformanceFrequency(&freq);
QueryPerformanceCounter(&start);
recursiveTranspose(0, MATSIZE, 0, MATSIZE);
QueryPerformanceCounter(&end);
printf("recursive: %.2fms\n", (end.QuadPart - start.QuadPart) / (double(freq.QuadPart) / 1000));
QueryPerformanceCounter(&start);
transpose();
QueryPerformanceCounter(&end);
printf("iterative: %.2fms\n", (end.QuadPart - start.QuadPart) / (double(freq.QuadPart) / 1000));
return 0;
}
results:
recursive: 480.58ms
iterative: 3678.46ms
Edit: About the influence of size: It is much less pronounced although still noticeable to some degree, that's because we're using the iterative solution as a leaf node instead of recursing down to 1 (the usual optimization for recursive algorithms). If we set LEAFSIZE = 1, the cache has no influence for me [8193: 1214.06; 8192: 1171.62ms, 8191: 1351.07ms - that's inside the margin of error, the fluctuations are in the 100ms area; this "benchmark" isn't something that I'd be too comfortable with if we wanted completely accurate values])
[1] Sources for this stuff: Well if you can't get a lecture from someone that worked with Leiserson and co on this.. I assume their papers a good starting point. Those algorithms are still quite rarely described - CLR has a single footnote about them. Still it's a great way to surprise people.
Edit (note: I'm not the one who posted this answer; I just wanted to add this):
Here's a complete C++ version of the above code:
template<class InIt, class OutIt>
void transpose(InIt const input, OutIt const output,
size_t const rows, size_t const columns,
size_t const r1 = 0, size_t const c1 = 0,
size_t r2 = ~(size_t) 0, size_t c2 = ~(size_t) 0,
size_t const leaf = 0x20)
{
if (!~c2) { c2 = columns - c1; }
if (!~r2) { r2 = rows - r1; }
size_t const di = r2 - r1, dj = c2 - c1;
if (di >= dj && di > leaf)
{
transpose(input, output, rows, columns, r1, c1, (r1 + r2) / 2, c2);
transpose(input, output, rows, columns, (r1 + r2) / 2, c1, r2, c2);
}
else if (dj > leaf)
{
transpose(input, output, rows, columns, r1, c1, r2, (c1 + c2) / 2);
transpose(input, output, rows, columns, r1, (c1 + c2) / 2, r2, c2);
}
else
{
for (ptrdiff_t i1 = (ptrdiff_t) r1, i2 = (ptrdiff_t) (i1 * columns);
i1 < (ptrdiff_t) r2; ++i1, i2 += (ptrdiff_t) columns)
{
for (ptrdiff_t j1 = (ptrdiff_t) c1, j2 = (ptrdiff_t) (j1 * rows);
j1 < (ptrdiff_t) c2; ++j1, j2 += (ptrdiff_t) rows)
{
output[j2 + i1] = input[i2 + j1];
}
}
}
}

Optimizing this code block

for (int i = 0; i < 5000; i++)
for (int j = 0; j < 5000; j++)
{
for (int ii = 0; ii < 20; ii++)
for (int jj = 0; jj < 20; jj++)
{
int num = matBigger[i+ii][j+jj];
// Extract range from this.
int low = num & 0xff;
int high = num >> 8;
if (low < matSmaller[ii][jj] && matSmaller[ii][jj] > high)
// match found
}
}
The machine is x86_64, 32kb L1 cahce, 256 Kb L2 cache.
Any pointers on how can I possibly optimize this code?
EDIT Some background to the original problem : Fastest way to Find a m x n submatrix in M X N matrix
First thing I'd try is to move the ii and jj loops outside the i and j loops. That way you're using the same elements of matSmaller for 25 million iterations of the i and j loops, meaning that you (or the compiler if you're lucky) can hoist the access to them outside those loops:
for (int ii = 0; ii < 20; ii++)
for (int jj = 0; jj < 20; jj++)
int smaller = matSmaller[ii][jj];
for (int i = 0; i < 5000; i++)
for (int j = 0; j < 5000; j++) {
int num = matBigger[i+ii][j+jj];
int low = num & 0xff;
if (low < smaller && smaller > (num >> 8)) {
// match found
}
}
This might be faster (thanks to less access to the matSmaller array), or it might be slower (because I've changed the pattern of access to the matBigger array, and it's possible that I've made it less cache-friendly). A similar alternative would be to move the ii loop outside i and j and hoist matSmaller[ii], but leave the jj loop inside. The rule of thumb is that it's more cache-friendly to increment the last index of a multi-dimensional array in your inner loops, than earlier indexes. So we're "happier" to modify jj and j than we are to modify ii and i.
Second thing I'd try - what's the type of matBigger? Looks like the values in it are only 16 bits, so try it both as int and as (u)int16_t. The former might be faster because aligned int access is fast. The latter might be faster because more of the array fits in cache at any one time.
There are some higher-level things you could consider with some early analysis of smaller: for example if it's 0 then you needn't examine matBigger for that value of ii and jj, because num & 0xff < 0 is always false.
To do better than "guess things and see whether they're faster or not" you need to know for starters which line is hottest, which means you need a profiler.
Some basic advice:
Profile it, so you can learn where the hot-spots are.
Think about cache locality, and the addresses resulting from your loop order.
Use more const in the innermost scope, to hint more to the compiler.
Try breaking it up so you don't compute high if the low test is failing.
Try maintaining the offset into matBigger and matSmaller explicitly, to the innermost stepping into a simple increment.
Best thing ist to understand what the code is supposed to do, then check whether another algorithm exists for this problem.
Apart from that:
if you are just interested if a matching entry exists, make sure to break out of all 3 loops at the position of // match found.
make sure the data is stored in an optimal way. It all depends on your problem, but i.e. it could be more efficient to have just one array of size 5000*5000*20 and overload operator()(int,int,int) for accessing elements.
What are matSmaller and matBigger?
Try changing them to matBigger[i+ii * COL_COUNT + j+jj]
I agree with Steve about rearranging your loops to have the higher count as the inner loop. Since your code is only doing loads and compares, I believe a significant portion of the time is used for pointer arithmetic. Try an experiment to change Steve's answer into this:
for (int ii = 0; ii < 20; ii++)
{
for (int jj = 0; jj < 20; jj++)
{
int smaller = matSmaller[ii][jj];
for (int i = 0; i < 5000; i++)
{
int *pI = &matBigger[i+ii][jj];
for (int j = 0; j < 5000; j++)
{
int num = *pI++;
int low = num & 0xff;
if (low < smaller && smaller > (num >> 8)) {
// match found
} // for j
} // for i
} // for jj
} // for ii
Even in 64-bit mode, the C compiler doesn't necessarily do a great job of keeping everything in register. By changing the array access to be a simple pointer increment, you'll make the compiler's job easier to produce efficient code.
Edit: I just noticed #unwind suggested basically the same thing. Another issue to consider is the statistics of your comparison. Is the low or high comparison more probable? Arrange the conditional statement so that the less probable test is first.
Looks like there is a lot of repetition here. One optimization is to reduce the amount of duplicate effort. Using pen and paper, I'm showing the matBigger "i" index iterating as:
[0 + 0], [0 + 1], [0 + 2], ..., [0 + 19],
[1 + 0], [1 + 1], ..., [1 + 18], [1 + 19]
[2 + 0], ..., [2 + 17], [2 + 18], [2 + 19]
As you can see there are locations that are accessed many times.
Also, multiplying the iteration counts indicate that the inner content is accessed: 20 * 20 * 5000 * 5000, or 10000000000 (10E+9) times. That's a lot!
So rather than trying to speed up the execution of 10E9 instructions (such as execution (pipeline) cache or data cache optimization), try reducing the number of iterations.
The code is searcing the matrix for a number that is within a range: larger than a minimal value and less than the maximum range value.
Based on this, try a different approach:
Find and remember all coordinates where the search value is greater
than the low value. Let us call these anchor points.
For each anchor point, find the coordinates of the first value after
the anchor point that is outside the range.
The objective is to reduce the number of duplicate accesses. Anchor points allow for a one pass scan and allow other decisions such as finding a range or determining an MxN matrix that contains the anchor value.
Another idea is to create new data structures containing the matBigger and matSmaller that are more optimized for searching.
For example, create a {value, coordinate list} entry for each unique value in matSmaller:
Value coordinate list
26 -> (2,3), (6,5), ..., (1007, 75)
31 -> (4,7), (2634, 5), ...
Now you can use this data structure to find values in matSmaller and immediately know their locations. So you could search matBigger for each unique value in this data structure. This again reduces the number of access to the matrices.