Cache, row major and column major - c++

I've been testing the differences of time it takes to sum the element of a matrix in row major order
std::vector<double> v( n * n );
// Timing begins
double sum{ 0.0 };
for (std::size_t i = 0; i < n; i++) {
for (std::size_t j = 0; j < n; j++) {
sum += v[i * n + j];
}
}
// Timing ends
and in column major order
std::vector<double> v( n * n );
// Timing begins
double sum{ 0.0 };
for (std::size_t j = 0; j < n; j++) {
for (std::size_t i = 0; i < n; i++) {
sum += v[i * n + j];
}
}
// Timing ends
The code has been compiled with
g++ -std=c++11 -Ofast -fno-tree-vectorize -DNDEBUG main.cpp -o main
We expect the timings of the row major order (blue) to be significantly faster than the column major order (yellow). If I plot the time it takes to run this algorithm (in nanoseconds) divided by the size in bytes of the array, I get the following graph on my computer which has a core-i7.
The x-axis displays n, and the y-axis displays the time in nanoseconds for the sumation divided by the size (in bytes) of v. Everything seems normal. The huge difference in between the two starts around n = 850 for which the size of the matrix is about 6MB which is exactly the size of my L3 cache. The column major order is 10 times slower than the row major order for large n. I am pleased with the results.
Next thing I do is run the same program on Amazon Web Services where they have an E5-2670. Here are the results of the same program.
The column major order is about 10 times slower than the row major order for 700 <= n <= 2000, but for n >= 2100, the cost per bytes of the column major order suddenly drops and it is just 2 times slower than the row major order!!! Does anyone have an explanation for this strange behaviour?
PS: For those who are interested, the full code is available here: https://www.dropbox.com/s/778hwpuriwqbi6o/InsideLoop.zip?dl=0

Related

How to obtain performance enhancement while multiplying two sub-matrices?

I've got a program multiplying two sub-matrices residing in the same container matrix. I'm trying to obtain some performance gain by using the OpenMP API for parallelization. Below is the multiplication algorithm I use.
#pragma omp parallel for
for(size_t i = 0; i < matrixA.m_edgeSize; i++) {
for(size_t k = 0; k < matrixA.m_edgeSize; k++) {
for(size_t j = 0; j < matrixA.m_edgeSize; j++) {
resultMatrix(i, j) += matrixA(i, k) * matrixB(k, j);
}
}
}
The algorithm accesses the elements of both input sub-matrices row-wise to enhance cache usage with the spatial locality.
What other OpenMP directives can be used to obtain better performance from that simple algorithm? Is there any other directive for optimizing the operations on the overlapping areas of two sub-matrices?
You can assume that all the sub-matrices have the same size and they are square-shaped. The resulting sub-matrix resides in another container matrix.
For the matrix-matrix product, any permutation of i,j,k indices computes the right result, sequentially. In parallel, not so. In your original code the k iterations do not write to unique locations, so you can not just collapse the outer two loops. Do a k,j interchange and then it is allowed.
Of course OpenMP gets you from 5 percent efficiency on one core to 5 percent on all cores. You really want to block the loops. But that is a lot harder. See the paper by Goto and van de Geijn.
I'm adding something related to main matrix. Do you use this code to multiply two bigger matrices? Then one of the sub-matrices are re-used between different iterations and likely to benefit from CPU cache. For example, if there are 4 sub-matrices of a matrix, then each sub-matrix is used twice, to get a value on result matrix.
To benefit from cache most, the re-used data should be kept in the cache of the same thread (core). To do this, maybe it is better to move the work-distribution level to the place where you select two submatrices.
So, something like this:
select sub-matrix A
#pragma omp parallel for
select sub-matrix B
for(size_t i = 0; i < matrixA.m_edgeSize; i++) {
for(size_t k = 0; k < matrixA.m_edgeSize; k++) {
for(size_t j = 0; j < matrixA.m_edgeSize; j++) {
resultMatrix(i, j) += matrixA(i, k) * matrixB(k, j);
}
}
}
could work faster since whole data always stays in same thread (core).

Trying to understand this solution

I was trying to solve a question and I got into a few obstacles that I failed to solve, starting off here is the question: Codeforces - 817D
Now I tried to brute force it, using a basic get min and max for each segment of the array I could generate and then keeping track of them I subtract them and add them together to get the final imbalance, this looked good but it gave me a time limit exceeded cause brute forcing n*(n+1)/2 subsegments of the array given n is 10^6 , so I just failed to go around it and after like a couple of hours of not getting any new ideas I decided to see a solution that I could not understand anything in to be honest :/ , here is the solution:
#include <bits/stdc++.h>
#define ll long long
int a[1000000], l[1000000], r[1000000];
int main(void) {
int i, j, n;
scanf("%d",&n);
for(i = 0; i < n; i++) scanf("%d",&a[i]);
ll ans = 0;
for(j = 0; j < 2; j++) {
vector<pair<int,int>> v;
v.push_back({-1,INF});
for(i = 0; i < n; i++) {
while (v.back().second <= a[i]) v.pop_back();
l[i] = v.back().first;
v.push_back({i,a[i]});
}
v.clear();
v.push_back({n,INF});
for(i = n-1; i >= 0; i--) {
while (v.back().second < a[i]) v.pop_back();
r[i] = v.back().first;
v.push_back({i,a[i]});
}
for(i = 0; i < n; i++) ans += (ll) a[i] * (i-l[i]) * (r[i]-i);
for(i = 0; i < n; i++) a[i] *= -1;
}
cout << ans;
}
I tried tracing it but I keep wondering why was the vector used , the only idea I got is he wanted to use the vector as a stack since they both act the same(Almost) but then the fact that I don't even know why we needed a stack here and this equation ans += (ll) a[i] * (i-l[i]) * (r[i]-i); is really confusing me because I don't get where did it come from.
Well thats a beast of a calculation. I must confess, that i don't understand it completely either. The problem with the brute force solution is, that you have to calculate values or all over again.
In a slightly modified example, you calculate the following values for an input of 2, 4, 1 (i reordered it by "distance")
[2, *, *] (from index 0 to index 0), imbalance value is 0; i_min = 0, i_max = 0
[*, 4, *] (from index 1 to index 1), imbalance value is 0; i_min = 1, i_max = 1
[*, *, 1] (from index 2 to index 2), imbalance value is 0; i_min = 2, i_max = 2
[2, 4, *] (from index 0 to index 1), imbalance value is 2; i_min = 0, i_max = 1
[*, 4, 1] (from index 1 to index 2), imbalance value is 3; i_min = 2, i_max = 1
[2, 4, 1] (from index 0 to index 2), imbalance value is 3; i_min = 2, i_max = 1
where i_min and i_max are the indices of the element with the minimum and maximum value.
For a better visual understanding, i wrote the complete array, but hid the unused values with *
So in the last case [2, 4, 1], brute-force looks for the minimum value over all values, which is not necessary, because you already calculated the values for a sub-space of the problem, by calculating [2,4] and [4,1]. But comparing only the values is not enough, you also need to keep track of the indices of the minimum and maximum element, because those can be reused in the next step, when calculating [2, 4, 1].
The idead behind this is a concept called dynamic programming, where results from a calculation are stored to be used again. As often, you have to choose between speed and memory consumption.
So to come back to your question, here is what i understood :
the arrays l and r are used to store the indices of the greatest number left or right of the current one
vector v is used to find the last number (and it's index) that is greater than the current one (a[i]). It keeps track of rising number series, e.g. for the input 5,3,4 at first the 5 is stored, then the 3 and when the 4 comes, the 3 is popped but the index of 5 is needed (to be stored in l[2])
then there is this fancy calculation (ans += (ll) a[i] * (i-l[i]) * (r[i]-i)). The stored indices of the maximum (and in the second run the minimum) elements are calculated together with the value a[i] which does not make much sense for me by now, but seems to work (sorry).
at last, all values in the array a are multiplied by -1, which means, the old maximums are now the minimums, and the calculation is done again (2nd run of the outer for-loop over j)
This last step (multiply a by -1) and the outer for-loop over j are not necessary but it's an elegant way to reuse the code.
Hope this helps a bit.

Implementing iterative autocorrelation process in C++ using for loops

I am implementing pitch tracking using an autocorrelation method in C++ but I am struggling to write the actual line of code which performs the autocorrelation.
I have an array containing a certain number ('values') of amplitude values of a pre-recorded signal, and I am performing the autocorrelation function on a set number (N) of these values.
In order to perform the autocorrelation I have taken the original array and reversed it so that point 0 = point N, point 1 = point N-1 etc, this array is called revarray
Here is what I want to do mathematically:
(array[0] * revarray[0])
(array[0] * revarray[1]) + (array[1] * revarray[0])
(array[0] * revarray[2]) + (array[1] * revarray[1]) + (array[2] * revarray[0])
(array[0] * revarray[3]) + (array[1] * revarray[2]) + (array[2] * revarray[1]) + (array[3] * revarray[0])
...and so on. This will be repeated for array[900]->array[1799] etc until autocorrelation has been performed on all of the samples in the array.
The number of times the autocorrelation is carried out is:
values / N = measurements
Here is the relevent section of my code so far
for (k = 0; k = measurements; ++k){
for (i = k*(N - 1), j = k*N; i >= 0; i--, j++){
revarray[j] = array[i];
for (a = k*N; a = k*(N - 1); ++a){
autocor[a]=0;
for (b = k*N; b = k*(N - 1); ++b){
autocor[a] += //**Here is where I'm confused**//
}
}
}
}
I know that I want to keep iteratively adding new values to autocor[a], but my problem is that the value that needs to be added to will keep changing. I've tried using an increasing count like so:
for (i = (k*N); i = k*(N-1); ++i){
autocor[i] += array[i] * revarray[i-1]
}
But I clearly know this won't work as when the new value is added to the previous autocor[i] this previous value will be incorrect, and when i=0 it will be impossible to calculate using revarray[i-1]
Any suggestions? Been struggling with this for a while now. I managed to get it working on just a single array (not taking N samples at a time) as seen here but I think using the inverted array is a much more efficient approach, I'm just struggling to implement the autocorrelation by taking sections of the entire signal.
It is not very clear to me, but I'll assume that you need to perform your iterations as many times as there are elements in that array (if it is indeed only half that much - adjust the code accordingly).
Also the N is assumed to mean the size of the array, so the index of the last element is N-1.
The loops would looks like that:
for(size_t i = 0; i < N; ++i){
autocorr[i] = 0;
for(size_t j = 0; j <= i; ++j){
const size_t idxA = j
, idxR = i - j; // direct and reverse indices in the array
autocorr[i] += array[idxA] * array[idxR];
}
}
Basically you run the outer loop as many times as there are elements in your array and for each of those iterations you run a shorter loop up to the current last index of the outer array.
All that is left to be done now is to properly calculate the indices of the array and revarray to perform the calculations and accummulate a running sum in the current outer loop's index.

Why is transposing a matrix of 512x512 much slower than transposing a matrix of 513x513?

After conducting some experiments on square matrices of different sizes, a pattern came up. Invariably, transposing a matrix of size 2^n is slower than transposing one of size 2^n+1. For small values of n, the difference is not major.
Big differences occur however over a value of 512. (at least for me)
Disclaimer: I know the function doesn't actually transpose the matrix because of the double swap of elements, but it makes no difference.
Follows the code:
#define SAMPLES 1000
#define MATSIZE 512
#include <time.h>
#include <iostream>
int mat[MATSIZE][MATSIZE];
void transpose()
{
for ( int i = 0 ; i < MATSIZE ; i++ )
for ( int j = 0 ; j < MATSIZE ; j++ )
{
int aux = mat[i][j];
mat[i][j] = mat[j][i];
mat[j][i] = aux;
}
}
int main()
{
//initialize matrix
for ( int i = 0 ; i < MATSIZE ; i++ )
for ( int j = 0 ; j < MATSIZE ; j++ )
mat[i][j] = i+j;
int t = clock();
for ( int i = 0 ; i < SAMPLES ; i++ )
transpose();
int elapsed = clock() - t;
std::cout << "Average for a matrix of " << MATSIZE << ": " << elapsed / SAMPLES;
}
Changing MATSIZE lets us alter the size (duh!). I posted two versions on ideone:
size 512 - average 2.46 ms - http://ideone.com/1PV7m
size 513 - average 0.75 ms - http://ideone.com/NShpo
In my environment (MSVS 2010, full optimizations), the difference is similar :
size 512 - average 2.19 ms
size 513 - average 0.57 ms
Why is this happening?
The explanation comes from Agner Fog in Optimizing software in C++ and it reduces to how data is accessed and stored in the cache.
For terms and detailed info, see the wiki entry on caching, I'm gonna narrow it down here.
A cache is organized in sets and lines. At a time, only one set is used, out of which any of the lines it contains can be used. The memory a line can mirror times the number of lines gives us the cache size.
For a particular memory address, we can calculate which set should mirror it with the formula:
set = ( address / lineSize ) % numberOfsets
This sort of formula ideally gives a uniform distribution across the sets, because each memory address is as likely to be read (I said ideally).
It's clear that overlaps can occur. In case of a cache miss, the memory is read in the cache and the old value is replaced. Remember each set has a number of lines, out of which the least recently used one is overwritten with the newly read memory.
I'll try to somewhat follow the example from Agner:
Assume each set has 4 lines, each holding 64 bytes. We first attempt to read the address 0x2710, which goes in set 28. And then we also attempt to read addresses 0x2F00, 0x3700, 0x3F00 and 0x4700. All of these belong to the same set. Before reading 0x4700, all lines in the set would have been occupied. Reading that memory evicts an existing line in the set, the line that initially was holding 0x2710. The problem lies in the fact that we read addresses that are (for this example) 0x800 apart. This is the critical stride (again, for this example).
The critical stride can also be calculated:
criticalStride = numberOfSets * lineSize
Variables spaced criticalStride or a multiple apart contend for the same cache lines.
This is the theory part. Next, the explanation (also Agner, I'm following it closely to avoid making mistakes):
Assume a matrix of 64x64 (remember, the effects vary according to the cache) with an 8kb cache, 4 lines per set * line size of 64 bytes. Each line can hold 8 of the elements in the matrix (64-bit int).
The critical stride would be 2048 bytes, which correspond to 4 rows of the matrix (which is continuous in memory).
Assume we're processing row 28. We're attempting to take the elements of this row and swap them with the elements from column 28. The first 8 elements of the row make up a cache line, but they'll go into 8 different cache lines in column 28. Remember, critical stride is 4 rows apart (4 consecutive elements in a column).
When element 16 is reached in the column (4 cache lines per set & 4 rows apart = trouble) the ex-0 element will be evicted from the cache. When we reach the end of the column, all previous cache lines would have been lost and needed reloading on access to the next element (the whole line is overwritten).
Having a size that is not a multiple of the critical stride messes up this perfect scenario for disaster, as we're no longer dealing with elements that are critical stride apart on the vertical, so the number of cache reloads is severely reduced.
Another disclaimer - I just got my head around the explanation and hope I nailed it, but I might be mistaken. Anyway, I'm waiting for a response (or confirmation) from Mysticial. :)
As an illustration to the explanation in Luchian Grigore's answer, here's what the matrix cache presence looks like for the two cases of 64x64 and 65x65 matrices (see the link above for details on numbers).
Colors in the animations below mean the following:
– not in cache,
– in cache,
– cache hit,
– just read from RAM,
– cache miss.
The 64x64 case:
Notice how almost every access to a new row results in a cache miss. And now how it looks for the normal case, a 65x65 matrix:
Here you can see that most of the accesses after the initial warming-up are cache hits. This is how CPU cache is intended to work in general.
The code that generated frames for the above animations can be seen here.
Luchian gives an explanation of why this behavior happens, but I thought it'd be a nice idea to show one possible solution to this problem and at the same time show a bit about cache oblivious algorithms.
Your algorithm basically does:
for (int i = 0; i < N; i++)
for (int j = 0; j < N; j++)
A[j][i] = A[i][j];
which is just horrible for a modern CPU. One solution is to know the details about your cache system and tweak the algorithm to avoid those problems. Works great as long as you know those details.. not especially portable.
Can we do better than that? Yes we can: A general approach to this problem are cache oblivious algorithms that as the name says avoids being dependent on specific cache sizes [1]
The solution would look like this:
void recursiveTranspose(int i0, int i1, int j0, int j1) {
int di = i1 - i0, dj = j1 - j0;
const int LEAFSIZE = 32; // well ok caching still affects this one here
if (di >= dj && di > LEAFSIZE) {
int im = (i0 + i1) / 2;
recursiveTranspose(i0, im, j0, j1);
recursiveTranspose(im, i1, j0, j1);
} else if (dj > LEAFSIZE) {
int jm = (j0 + j1) / 2;
recursiveTranspose(i0, i1, j0, jm);
recursiveTranspose(i0, i1, jm, j1);
} else {
for (int i = i0; i < i1; i++ )
for (int j = j0; j < j1; j++ )
mat[j][i] = mat[i][j];
}
}
Slightly more complex, but a short test shows something quite interesting on my ancient e8400 with VS2010 x64 release, testcode for MATSIZE 8192
int main() {
LARGE_INTEGER start, end, freq;
QueryPerformanceFrequency(&freq);
QueryPerformanceCounter(&start);
recursiveTranspose(0, MATSIZE, 0, MATSIZE);
QueryPerformanceCounter(&end);
printf("recursive: %.2fms\n", (end.QuadPart - start.QuadPart) / (double(freq.QuadPart) / 1000));
QueryPerformanceCounter(&start);
transpose();
QueryPerformanceCounter(&end);
printf("iterative: %.2fms\n", (end.QuadPart - start.QuadPart) / (double(freq.QuadPart) / 1000));
return 0;
}
results:
recursive: 480.58ms
iterative: 3678.46ms
Edit: About the influence of size: It is much less pronounced although still noticeable to some degree, that's because we're using the iterative solution as a leaf node instead of recursing down to 1 (the usual optimization for recursive algorithms). If we set LEAFSIZE = 1, the cache has no influence for me [8193: 1214.06; 8192: 1171.62ms, 8191: 1351.07ms - that's inside the margin of error, the fluctuations are in the 100ms area; this "benchmark" isn't something that I'd be too comfortable with if we wanted completely accurate values])
[1] Sources for this stuff: Well if you can't get a lecture from someone that worked with Leiserson and co on this.. I assume their papers a good starting point. Those algorithms are still quite rarely described - CLR has a single footnote about them. Still it's a great way to surprise people.
Edit (note: I'm not the one who posted this answer; I just wanted to add this):
Here's a complete C++ version of the above code:
template<class InIt, class OutIt>
void transpose(InIt const input, OutIt const output,
size_t const rows, size_t const columns,
size_t const r1 = 0, size_t const c1 = 0,
size_t r2 = ~(size_t) 0, size_t c2 = ~(size_t) 0,
size_t const leaf = 0x20)
{
if (!~c2) { c2 = columns - c1; }
if (!~r2) { r2 = rows - r1; }
size_t const di = r2 - r1, dj = c2 - c1;
if (di >= dj && di > leaf)
{
transpose(input, output, rows, columns, r1, c1, (r1 + r2) / 2, c2);
transpose(input, output, rows, columns, (r1 + r2) / 2, c1, r2, c2);
}
else if (dj > leaf)
{
transpose(input, output, rows, columns, r1, c1, r2, (c1 + c2) / 2);
transpose(input, output, rows, columns, r1, (c1 + c2) / 2, r2, c2);
}
else
{
for (ptrdiff_t i1 = (ptrdiff_t) r1, i2 = (ptrdiff_t) (i1 * columns);
i1 < (ptrdiff_t) r2; ++i1, i2 += (ptrdiff_t) columns)
{
for (ptrdiff_t j1 = (ptrdiff_t) c1, j2 = (ptrdiff_t) (j1 * rows);
j1 < (ptrdiff_t) c2; ++j1, j2 += (ptrdiff_t) rows)
{
output[j2 + i1] = input[i2 + j1];
}
}
}
}

Optimizing this code block

for (int i = 0; i < 5000; i++)
for (int j = 0; j < 5000; j++)
{
for (int ii = 0; ii < 20; ii++)
for (int jj = 0; jj < 20; jj++)
{
int num = matBigger[i+ii][j+jj];
// Extract range from this.
int low = num & 0xff;
int high = num >> 8;
if (low < matSmaller[ii][jj] && matSmaller[ii][jj] > high)
// match found
}
}
The machine is x86_64, 32kb L1 cahce, 256 Kb L2 cache.
Any pointers on how can I possibly optimize this code?
EDIT Some background to the original problem : Fastest way to Find a m x n submatrix in M X N matrix
First thing I'd try is to move the ii and jj loops outside the i and j loops. That way you're using the same elements of matSmaller for 25 million iterations of the i and j loops, meaning that you (or the compiler if you're lucky) can hoist the access to them outside those loops:
for (int ii = 0; ii < 20; ii++)
for (int jj = 0; jj < 20; jj++)
int smaller = matSmaller[ii][jj];
for (int i = 0; i < 5000; i++)
for (int j = 0; j < 5000; j++) {
int num = matBigger[i+ii][j+jj];
int low = num & 0xff;
if (low < smaller && smaller > (num >> 8)) {
// match found
}
}
This might be faster (thanks to less access to the matSmaller array), or it might be slower (because I've changed the pattern of access to the matBigger array, and it's possible that I've made it less cache-friendly). A similar alternative would be to move the ii loop outside i and j and hoist matSmaller[ii], but leave the jj loop inside. The rule of thumb is that it's more cache-friendly to increment the last index of a multi-dimensional array in your inner loops, than earlier indexes. So we're "happier" to modify jj and j than we are to modify ii and i.
Second thing I'd try - what's the type of matBigger? Looks like the values in it are only 16 bits, so try it both as int and as (u)int16_t. The former might be faster because aligned int access is fast. The latter might be faster because more of the array fits in cache at any one time.
There are some higher-level things you could consider with some early analysis of smaller: for example if it's 0 then you needn't examine matBigger for that value of ii and jj, because num & 0xff < 0 is always false.
To do better than "guess things and see whether they're faster or not" you need to know for starters which line is hottest, which means you need a profiler.
Some basic advice:
Profile it, so you can learn where the hot-spots are.
Think about cache locality, and the addresses resulting from your loop order.
Use more const in the innermost scope, to hint more to the compiler.
Try breaking it up so you don't compute high if the low test is failing.
Try maintaining the offset into matBigger and matSmaller explicitly, to the innermost stepping into a simple increment.
Best thing ist to understand what the code is supposed to do, then check whether another algorithm exists for this problem.
Apart from that:
if you are just interested if a matching entry exists, make sure to break out of all 3 loops at the position of // match found.
make sure the data is stored in an optimal way. It all depends on your problem, but i.e. it could be more efficient to have just one array of size 5000*5000*20 and overload operator()(int,int,int) for accessing elements.
What are matSmaller and matBigger?
Try changing them to matBigger[i+ii * COL_COUNT + j+jj]
I agree with Steve about rearranging your loops to have the higher count as the inner loop. Since your code is only doing loads and compares, I believe a significant portion of the time is used for pointer arithmetic. Try an experiment to change Steve's answer into this:
for (int ii = 0; ii < 20; ii++)
{
for (int jj = 0; jj < 20; jj++)
{
int smaller = matSmaller[ii][jj];
for (int i = 0; i < 5000; i++)
{
int *pI = &matBigger[i+ii][jj];
for (int j = 0; j < 5000; j++)
{
int num = *pI++;
int low = num & 0xff;
if (low < smaller && smaller > (num >> 8)) {
// match found
} // for j
} // for i
} // for jj
} // for ii
Even in 64-bit mode, the C compiler doesn't necessarily do a great job of keeping everything in register. By changing the array access to be a simple pointer increment, you'll make the compiler's job easier to produce efficient code.
Edit: I just noticed #unwind suggested basically the same thing. Another issue to consider is the statistics of your comparison. Is the low or high comparison more probable? Arrange the conditional statement so that the less probable test is first.
Looks like there is a lot of repetition here. One optimization is to reduce the amount of duplicate effort. Using pen and paper, I'm showing the matBigger "i" index iterating as:
[0 + 0], [0 + 1], [0 + 2], ..., [0 + 19],
[1 + 0], [1 + 1], ..., [1 + 18], [1 + 19]
[2 + 0], ..., [2 + 17], [2 + 18], [2 + 19]
As you can see there are locations that are accessed many times.
Also, multiplying the iteration counts indicate that the inner content is accessed: 20 * 20 * 5000 * 5000, or 10000000000 (10E+9) times. That's a lot!
So rather than trying to speed up the execution of 10E9 instructions (such as execution (pipeline) cache or data cache optimization), try reducing the number of iterations.
The code is searcing the matrix for a number that is within a range: larger than a minimal value and less than the maximum range value.
Based on this, try a different approach:
Find and remember all coordinates where the search value is greater
than the low value. Let us call these anchor points.
For each anchor point, find the coordinates of the first value after
the anchor point that is outside the range.
The objective is to reduce the number of duplicate accesses. Anchor points allow for a one pass scan and allow other decisions such as finding a range or determining an MxN matrix that contains the anchor value.
Another idea is to create new data structures containing the matBigger and matSmaller that are more optimized for searching.
For example, create a {value, coordinate list} entry for each unique value in matSmaller:
Value coordinate list
26 -> (2,3), (6,5), ..., (1007, 75)
31 -> (4,7), (2634, 5), ...
Now you can use this data structure to find values in matSmaller and immediately know their locations. So you could search matBigger for each unique value in this data structure. This again reduces the number of access to the matrices.