Access efficiency of C++ 2D array - c++

I have a 2D array a1[10000][100] with 10000 rows and 100 columns, and also a 2D array a2[100][10000] which is the transposed matrix of a1.
Now I need to access 2 columns (eg. the 21th and the 71th columns) of a1 in the order of a1[0][20], a1[0][70], a1[1][20], a1[1][70], ..., a1[9999][20], a1[9999][70]. Or I can also access a2 to achive the same goal (the order: a2[20][0], a2[70][0], a2[20][1], a2[70][1], ..., a2[20][9999], a2[70][9999]). But the latter is much faster than the former. The related code is simplified as follows (size1 = 10000):
1 sum1 = 0;
2 for (i = 0; i < size1; ++i) {
3 x = a1[i][20];
4 y = a1[i][70];
5 sum1 = x + y;
6 } // this loop is slower
7 a_sum1[i] = sum1;
8
9 sum2 = 0;
10 for (i = 0; i < size1; ++i) {
11 x = a2[20][i];
12 y = a2[70][i];
14 sum2 = x + y;
15 } // this loop is faster
16 a_sum2[i] = sum2;
Accessing more rows (I have also tried 3, 4 rows rather than 2 rows in the example above) of a2 is also faster than accessing the same number of columns of a1. Of course I can also replace Lines 3-5 (or Lines 11-14) with a loop (by using an extra array to store the column/row indexes to be accessed), it also gets the same result that the latter is faster than the former.
Why is the latter much faster than the former? I know something about cache lines but I have no idea of the reason for this case. Thanks.

You can benefit from the memory cache if you access addresses within the same cache line in a short amount of time. The explanation below assumes your arrays contain 4-byte integers.
In your first loop, your two memory accesses in the loop are 50*4 bytes apart, and the next iteration jumps forward 400 bytes. Every memory access here is a cache miss.
In the second loop, you still have two memory accesses that are 50*400 bytes apart, but on the next loop iteration you access addresses that are right next to the previously fetched value. Assuming the common 64-byte cache line size, you only have two cache misses every 16 iterations of the loop, the rest can be served from two cache lines loaded at the start of such a cycle.

This is because C++ has a row-major order (https://en.wikipedia.org/wiki/Row-_and_column-major_order). You should avoid column-major access in C/C++ (https://www.appentra.com/knowledge/checks/pwr010/).
The reason is that the elements are stored by rows and the access by rows allows to better use cache lines, vectorization and other hardware features/techniques.

The reason is cache locality.
a2[20][0], a2[20][1], a2[20][2] ... are stored in memory next to each other. And a1[0][20], a1[1][20], a1[2][20] ... aren't (the same applies to a2[70][0], a2[70][1], a2[70][2] ...).
That means that accessing a1[0][20], a1[1][20], a1[2][20] would waste DRAM bandwidth, as it would use only 4 or 8 bytes of each 64-byte cache line loaded from DRAM.

Related

Why can adding padding make your loop faster?

People told me that adding padding can help to have better performance because it's using the cache in a better way.
I don't understand how is it possible that by making your data bigger you get better performance.
Can someone understand why?
Array padding
The array padding technique consists of increasing the size of the array dimensions in order to reduce conflict misses when accessing a cache memory.
This type of miss can occur when the number of accessed elements mapping to the same set is greater than the degree of associativity of the cache.
Padding changes the data layout and can be applied (1) between variables (Inter-Variable Padding) or (2) to a variable (Intra-Variable Padding):
1. Inter-Variable Padding
float x[LEN], padding[P], y[LEN];
float redsum() {
float s = 0;
for (int i = 0; i < LEN; i++)
s = s + x[i] + y[i];
return s;
}
If we have a direct mapped cache and the elements x[i] and y[i] are mapped into the same set, accesses to x will evict a block from y and vice versa, resulting in a high miss rate and low performance.
2. Intra-Variable Padding
float x[LEN][LEN+PAD], y[LEN][LEN];
void symmetrize() {
for (int i = 0; i < LEN; i++) {
for (int j = 0; j < LEN; j++)
y[i][j] = 0.5 *(x[i][j] + x[j][i]);
}
}
In this case, if the elements of a column are mapped into a small number of sets, their sequence of accesses may lead to conflict misses, so that the spatial locality would not be exploited.
For example, suppose that during the first iteration of the outer loop, the block containing x[0][0] x[0][1] ... x[0][15] is evicted to store the block containing the element x[k][0]. Then, at the start of the second iteration, the reference to x[0][1] would cause a cache miss.
This technical document analyses the performance of the Fast Fourier Transform (FFT) as a function of the size of the matrix used in the calculations:
https://www.intel.com/content/www/us/en/developer/articles/technical/fft-length-and-layout-advisor.html
References
Gabriel Rivera and Chau-Wen Tseng. Data transformations for eliminating conflict misses. PLDI 1998. DOI: https://doi.org/10.1145/277650.277661
Changwan Hong et al. Effective padding of multidimensional arrays to avoid cache conflict misses. PLDI 2016. DOI: https://doi.org/10.1145/2908080.2908123
I don't think it would matter in a simple loop.
Have a look at this answer: Does alignment really matter for performance in C++11?
The most interesting bit for you from that answer is probably that you could arrange your classes so that members used together are in one cache line and those used by different threads are not.

Efficiently count number of distinct values in 16-byte buffer in arm neon

Here's the basic algorithm to count number of distinct values in a buffer:
unsigned getCount(const uint8_t data[16])
{
uint8_t pop[256] = { 0 };
unsigned count = 0;
for (int i = 0; i < 16; ++i)
{
uint8_t b = data[i];
if (0 == pop[b])
count++;
pop[b]++;
}
return count;
}
Can this be done somehow in neon efficiently by loading into a q-reg and doing some bit magic? Alternatively, can I efficiently say that data has all elements identical, or contains only two distinct values or more than two?
For example, using vminv_u8 and vmaxv_u8 I can find min and max elements and if they are equal I know that data has identical elements. If not, then I can vceq_u8 with min value and vceq_u8 with max value and then vorr_u8 these results and compare that I have all 1-s in the result. Basically, in neon it can be done this way. Any ideas how to make it better?
unsigned getCountNeon(const uint8_t data[16])
{
uint8x16_t s = vld1q_u8(data);
uint8x16_t smin = vdupq_n_u8(vminvq_u8(s));
uint8x16_t smax = vdupq_n_u8(vmaxvq_u8(s));
uint8x16_t res = vdupq_n_u8(1);
uint8x16_t one = vdupq_n_u8(1);
for (int i = 0; i < 14; ++i) // this obviously needs to be unrolled
{
s = vbslq_u8(vceqq_u8(s, smax), smin, s); // replace max with min
uint8x16_t smax1 = vdupq_n_u8(vmaxvq_u8(s));
res = vaddq_u8(res, vaddq_u8(vceqq_u8(smax1, smax), one));
smax = smax1;
}
res = vaddq_u8(res, vaddq_u8(vceqq_u8(smax, smin), one));
return vgetq_lane_u8(res, 0);
}
With some optimizations and improvements perhaps a 16-byte block can be processed in 32-48 neon instructions. Can this be done better in arm? Unlikely
Some background why I ask this question. As I'm working on an algorithm I'm trying different approaches at processing data and I'm not sure yet what exactly I'll use at the end. Information that might be of use:
count of distinct elements per 16-byte block
value that repeats most per 16-byte block
average per block
median per block
speed of light?.. that's a joke, it cannot be computed in neon from 16-byte block :)
so, I'm trying stuff, and before I use any approach I want to see if that approach can be well optimized. For example, average per block will be memcpy speed on arm64 basically.
If you're expecting a lot of duplicates, and can efficiently get a horizontal min with vminv_u8, this might be better than scalar. Or not, maybe NEON->ARM stalls for the loop condition kill it. >.< But it should be possible to mitigate that with unrolling (and saving some info in registers to figure out how far you overshot).
// pseudo-code because I'm too lazy to look up ARM SIMD intrinsics, edit welcome
// But I *think* ARM can do these things efficiently,
// except perhaps the loop condition. High latency could be ok, but stalling isn't
int count_dups(uint8x16_t v)
{
int dups = (0xFF == vmax_u8(v)); // count=1 if any elements are 0xFF to start
auto hmin = vmin_u8(v);
while (hmin != 0xff) {
auto min_bcast = vdup(hmin); // broadcast the minimum
auto matches = cmpeq(v, min_bcast);
v |= matches; // min and its dups become 0xFF
hmin = vmin_u8(v);
dups++;
}
return dups;
}
This turns unique values into 0xFF, one set of duplicates at a time.
The loop-carried dep chain through v / hmin stays in vector registers; it's only the loop branch that needs NEON->integer.
Minimizing / hiding NEON->integer/ARM penalties
Unroll by 8 with no branches on hmin, leaving results in 8 NEON registers. Then transfer those 8 values; back-to-back transfers of multiple NEON registers to ARM only incurs one total stall (of 14 cycles on whatever Jake tested on.) Out-of-order execution could also hide some of the penalty for this stall. Then check those 8 integer registers with a fully-unrolled integer loop.
Tune the unroll factor to be large enough that you usually don't need another round of SIMD operations for most input vectors. If almost all of your vectors have at most 5 unique values, then unroll by 5 instead of 8.
Instead of transferring multiple hmin results to integer, count them in NEON. If you can use ARM32 NEON partial-register tricks to put multiple hmin values in the same vector for free, it's only a bit more work to shuffle 8 of them into one vector and compare for not-equal to 0xFF. Then horizontally add that compare result to get a -count.
Or if you have values from different input vectors in different elements of a single vector, you can use vertical operations to add results for multiple input vectors at once without needing horizontal ops.
There's almost certainly room to optimize this, but I don't know ARM that well, or ARM performance details. NEON's hard to use for anything conditional because of the big performance penalty for NEON->integer, totally unlike x86. Glibc has a NEON memchr with NEON->integer in the loop, but I don't know if it uses it or if it's faster than scalar.
Speeding up repeated calls to the scalar ARM version:
Zeroing the 256-byte buffer every time would be expensive, but we don't need to do that. Use a sequence number to avoid needing to reset:
Before every new set of elements: ++seq;
For each element in the set:
sum += (histogram[i] == seq);
histogram[i] = seq; // no data dependency on the load result, unlike ++
You might make the histogram an array of uint16_t or uint32_t to avoid needing to re-zero if a uint8_t seq wraps. But then it takes more cache footprint, so maybe just re-zeroing every 254 sequence numbers makes the most sense.

cache friendly C++ operation on matrix in C++?

My application does some operations on matrices of large size.
I recently came accross the concept of cache & the performance effect it can have through this answer.
I would like to know what would be the best algorithm which is cache friendly for my case.
Algorithm 1:
for(int i = 0; i < size; i++)
{
for(int j = i + 1; j < size; j++)
{
c[i][j] -= K * c[j][j];//K is a constant double variable
}//c is a 2 dimensional array of double variables
}
Algorithm 2:
double *A = new double[size];
for(int n = 0; n < size; n++)
A[n] = c[n][n];
for(int i = 0; i < size; i++)
{
for(int j = i + 1; j < size; j++)
{
c[i][j] -= K * A[j];
}
}
The size of my array is more than 1000x1000.
Benchmarking on my laptop shows Algorithm 2 is better than 1 for size 5000x5000.
Please note that I have multi threaded my application such that a set of rows is operated by a thread.
For example: For array of size 1000x1000.
thread1 -> row 0 to row 249
thread2 -> row 250 to row 499
thread3 -> row 500 to row 749
thread4 -> row 750 to row 999
If your benchmarks show significant improvement for the second case, then it most likely is the better choice. But of course, to know for "an average CPU", we'd have to know that for a large number of CPU's that can be called average - there is no other way. And it will really depend on the definition of Average CPU. Are we talking "any x86 (AMD + Intel) CPU" or "Any random CPU that we can find in anything from a watch to the latest super-fast creation in the x86 range"?
The "copy the data in c[n][n]" method helps because it gets its own address, and doesn't get thrown out of the (L1) cache when the code walks its way over the larger matrix [and all the data you need for the multiplication is "close together". If you walk c[j][j], every j steps will jump sizeof(double) * (size * j + 1) bytes per iteration, so if size is anything more than 4, the next item needed wont be in the same cache-line, so another memory read is needed to get that data.
In other words, for anything that has a decent size cache (bigger than size * sizeof(double)), it's a definite benefit. Even with smaller cache, it's quite likely SOME benefit, but the chances are higher that the cached copy will be thrown out by some part of c[i][j].
In summary, the second algorithm is very likely better for nearly all options.
Algorithm2 benefits from what's called "spatial locality", moving the diagonal into a single dimension array makes it reside in memory in consecutive addresses, and thereby:
Enjoys the benefit of fetching multiple useful elements per a single cache line (presumably 64byte, depending on your CPU), better utilizing cache and memory BW (whereas c[n][n] would also fetch a lot of useless data since it's in the same lines).
Enjoys the benefits of a HW stream prefetchers (assuming such exist in your CPU), that aggressively run ahead of your code along the page and brings the data in advance to the lower cache levels, improving the memory latency.
It should be pointed that moving the data to A doesn't necessarily improve cacheability since A would still compete against a lot of data constantly coming from c and thrashing the cache. However, since it's used over and over, there's a high chance that a good LRU algorithm would make it stay in the cache anyway. You could help that by using streaming memory operations for array c. It should be noted that these are very volatile performance tools, and may on some scenarios lead to perf reduction if not used correctly.
Another potential benefit could come from mixing SW prefetches slightly ahead of reaching every new array line.

Fast merge of sorted subsets of 4K floating-point numbers in L1/L2

What is a fast way to merge sorted subsets of an array of up to 4096 32-bit floating point numbers on a modern (SSE2+) x86 processor?
Please assume the following:
The size of the entire set is at maximum 4096 items
The size of the subsets is open to discussion, but let us assume between 16-256 initially
All data used through the merge should preferably fit into L1
The L1 data cache size is 32K. 16K has already been used for the data itself, so you have 16K to play with
All data is already in L1 (with as high degree of confidence as possible) - it has just been operated on by a sort
All data is 16-byte aligned
We want to try to minimize branching (for obvious reasons)
Main criteria of feasibility: faster than an in-L1 LSD radix sort.
I'd be very interested to see if someone knows of a reasonable way to do this given the above parameters! :)
Here's a very naive way to do it. (Please excuse any 4am delirium-induced pseudo-code bugs ;)
//4x sorted subsets
data[4][4] = {
{3, 4, 5, INF},
{2, 7, 8, INF},
{1, 4, 4, INF},
{5, 8, 9, INF}
}
data_offset[4] = {0, 0, 0, 0}
n = 4*3
for(i=0, i<n, i++):
sub = 0
sub = 1 * (data[sub][data_offset[sub]] > data[1][data_offset[1]])
sub = 2 * (data[sub][data_offset[sub]] > data[2][data_offset[2]])
sub = 3 * (data[sub][data_offset[sub]] > data[3][data_offset[3]])
out[i] = data[sub][data_offset[sub]]
data_offset[sub]++
Edit:
With AVX2 and its gather support, we could compare up to 8 subsets at once.
Edit 2:
Depending on type casting, it might be possible to shave off 3 extra clock cycles per iteration on a Nehalem (mul: 5, shift+sub: 4)
//Assuming 'sub' is uint32_t
sub = ... << ((data[sub][data_offset[sub]] > data[...][data_offset[...]]) - 1)
Edit 3:
It may be possible to exploit out-of-order execution to some degree, especially as K gets larger, by using two or more max values:
max1 = 0
max2 = 1
max1 = 2 * (data[max1][data_offset[max1]] > data[2][data_offset[2]])
max2 = 3 * (data[max2][data_offset[max2]] > data[3][data_offset[3]])
...
max1 = 6 * (data[max1][data_offset[max1]] > data[6][data_offset[6]])
max2 = 7 * (data[max2][data_offset[max2]] > data[7][data_offset[7]])
q = data[max1][data_offset[max1]] < data[max2][data_offset[max2]]
sub = max1*q + ((~max2)&1)*q
Edit 4:
Depending on compiler intelligence, we can remove multiplications altogether using the ternary operator:
sub = (data[sub][data_offset[sub]] > data[x][data_offset[x]]) ? x : sub
Edit 5:
In order to avoid costly floating point comparisons, we could simply reinterpret_cast<uint32_t*>() the data, as this would result in an integer compare.
Another possibility is to utilize SSE registers as these are not typed, and explicitly use integer comparison instructions.
This works due to the operators < > == yielding the same results when interpreting a float on the binary level.
Edit 6:
If we unroll our loop sufficiently to match the number of values to the number of SSE registers, we could stage the data that is being compared.
At the end of an iteration we would then re-transfer the register which contained the selected maximum/minimum value, and shift it.
Although this requires reworking the indexing slightly, it may prove more efficient than littering the loop with LEA's.
This is more of a research topic, but I did find this paper which discusses minimizing branch mispredictions using d-way merge sort.
SIMD sorting algorithms have already been studied in detail. The paper Efficient Implementation of Sorting on Multi-Core SIMD CPU Architecture describes an efficient algorithm for doing what you describe (and much more).
The core idea is that you can reduce merging two arbitrarily long lists to merging blocks of k consecutive values (where k can range from 4 to 16): the first block is z[0] = merge(x[0], y[0]).lo. To obtain the second block, we know that the leftover merge(x[0], y[0]).hi contains nx elements from x and ny elements from y, with nx+ny == k. But z[1] cannot contain elements from both x[1] and y[1], because that would require z[1] to contain more than nx+ny elements: so we just have to find out which of x[1] and y[1] needs to be added. The one with the lower first element will necessarily appear first in z, so this is simply done by comparing their first element. And we just repeat that until there is no more data to merge.
Pseudo-code, assuming the arrays end with a +inf value:
a := *x++
b := *y++
while not finished:
lo,hi := merge(a,b)
*z++ := lo
a := hi
if *x[0] <= *y[0]:
b := *x++
else:
b := *y++
(note how similar this is to the usual scalar implementation of merging)
The conditional jump is of course not necessary in an actual implementation: for example, you could conditionally swap x and y with an xor trick, and then read unconditionally *x++.
merge itself can be implemented with a bitonic sort. But if k is low, there will be a lot of inter-instruction dependencies resulting in high latency. Depending on the number of arrays you have to merge, you can then choose k high enough so that the latency of merge is masked, or if this is possible interleave several two-way merges. See the paper for more details.
Edit: Below is a diagram when k = 4. All asymptotics assume that k is fixed.
The big gray box is merging two arrays of size n = m * k (in the picture, m = 3).
We operate on blocks of size k.
The "whole-block merge" box merges the two arrays block-by-block by comparing their first elements. This is a linear time operation, and it doesn't consume memory because we stream the data to the rest of the block. The performance doesn't really matter because the latency is going to be limited by the latency of the "merge4" blocks.
Each "merge4" box merges two blocks, outputs the lower k elements, and feeds the upper k elements to the next "merge4". Each "merge4" box performs a bounded number of operations, and the number of "merge4" is linear in n.
So the time cost of merging is linear in n. And because "merge4" has a lower latency than performing 8 serial non-SIMD comparisons, there will be a large speedup compared to non-SIMD merging.
Finally, to extend our 2-way merge to merge many arrays, we arrange the big gray boxes in classical divide-and-conquer fashion. Each level has complexity linear in the number of elements, so the total complexity is O(n log (n / n0)) with n0 the initial size of the sorted arrays and n is the size of the final array.
The most obvious answer that comes to mind is a standard N-way merge using a heap. That'll be O(N log k). The number of subsets is between 16 and 256, so the worst case behavior (with 256 subsets of 16 items each) would be 8N.
Cache behavior should be ... reasonable, although not perfect. The heap, where most of the action is, will probably remain in the cache throughout. The part of the output array being written to will also most likely be in the cache.
What you have is 16K of data (the array with sorted subsequences), the heap (1K, worst case), and the sorted output array (16K again), and you want it to fit into a 32K cache. Sounds like a problem, but perhaps it isn't. The data that will most likely be swapped out is the front of the output array after the insertion point has moved. Assuming that the sorted subsequences are fairly uniformly distributed, they should be accessed often enough to keep them in the cache.
You can merge int arrays (expensive) branch free.
typedef unsigned uint;
typedef uint* uint_ptr;
void merge(uint*in1_begin, uint*in1_end, uint*in2_begin, uint*in2_end, uint*out){
int_ptr in [] = {in1_begin, in2_begin};
int_ptr in_end [] = {in1_end, in2_end};
// the loop branch is cheap because it is easy predictable
while(in[0] != in_end[0] && in[1] != in_end[1]){
int i = (*in[0] - *in[1]) >> 31;
*out = *in[i];
++out;
++in[i];
}
// copy the remaining stuff ...
}
Note that (*in[0] - *in[1]) >> 31 is equivalent to *in[0] - *in[1] < 0 which is equivalent to *in[0] < *in[1]. The reason I wrote it down using the bitshift trick instead of
int i = *in[0] < *in[1];
is that not all compilers generate branch free code for the < version.
Unfortunately you are using floats instead of ints which at first seems like a showstopper because I do not see how to realabily implement *in[0] < *in[1] branch free. However, on most modern architectures you interprete the bitpatterns of positive floats (that also are no NANs, INFs or such strange things) as ints and compare them using < and you will still get the correct result. Perhaps you extend this observation to arbitrary floats.
You could do a simple merge kernel to merge K lists:
float *input[K];
float *output;
while (true) {
float min = *input[0];
int min_idx = 0;
for (int i = 1; i < K; i++) {
float v = *input[i];
if (v < min) {
min = v; // do with cmov
min_idx = i; // do with cmov
}
}
if (min == SENTINEL) break;
*output++ = min;
input[min_idx]++;
}
There's no heap, so it is pretty simple. The bad part is that it is O(NK), which can be bad if K is large (unlike the heap implementation which is O(N log K)). So then you just pick a maximum K (4 or 8 might be good, then you can unroll the inner loop), and do larger K by cascading merges (handle K=64 by doing 8-way merges of groups of lists, then an 8-way merge of the results).

Speed of C program execution

I got one problem at my exam for subject Principal of Programming Language. I thought for long time but i still did not understand the problem
Problem:
Below is a program C, that is executed in MSVC++ 6.0 environment on a PC with configuration ~ CPU Intel 1.8GHz, Ram 512MB
#define M 10000
#define N 5000
int a[M][N];
void main() {
int i, j;
time_t start, stop;
// Part A
start = time(0);
for (i = 0; i < M; i++)
for (j = 0; j < N; j++)
a[i][j] = 0;
stop = time(0);
printf("%d\n", stop - start);
// Part B
start = time(0);
for (j = 0; j < N; j++)
for (i = 0; i < M; i++)
a[i][j] = 0;
stop = time(0);
printf("%d\n", stop - start);
}
Explain why does part A only execute in 1s, but it took part B 8s to finish?
This has to do with how the array's memory is laid out and how it gets loaded into the cache and accessed: in version A, when accessing a cell of the array, the neighbors get loaded with it into the cache, and the code then immediately accesses those neighbors. In version B, one cell is accessed (and its neighbors loaded into the cache), but the next access is far away, on the next row, and so the whole cache line was loaded but only one value used, and another cache line must be filled for each access. Hence the speed difference.
Row-major order versus column-major order.
Recall first that all multi-dimensional arrays are represented in memory as a continguous block of memory. Thus the multidimensional array A(m,n) might be represented in memory as
a00 a01 a02 ... a0n a10 a11 a12 ... a1n a20 ... amn
In the first loop, you run through this block of memory sequentially. Thus, you run through the array traversing the elements in the following order
a00 a01 a02 ... a0n a10 a11 a12 ... a1n a20 ... amn
1 2 3 n n+1 n+2 n+3 ... 2n 2n+1 mn
In the second loop, you skip around in memory and run through the array traversing the elements in the following order
a00 a10 a20 ... am0 a01 a11 a21 ... am1 a02 ... amn
or, perhaps more clearly,
a00 a01 a02 ... a10 a11 a12 ... a20 ... amn
1 m+1 2m+1 2 m+2 2m+2 3 mn
All that skipping around really hurts you because you don't gain advantages from caching. When you run through the array sequentially, neighboring elements are loaded into the cache. When you skip around through the array, you don't get these benefits and instead keep getting cache misses harming performance.
Because of hardware architectural optimizations. Part A is executing operations on sequential memory addresses, which allows the hardware to substantially accelerate how the calculations are handled. Part B is basically jumping around in memory all the time, which defeats a lot of hardware optimizations.
Key concept for this specific case is processor cache.
The array you are declaring is laid out line-wise in memory. Basically you have a large block of M×N integers and C does a little trickery to make you believe that it's rectangular. But in reality it's flat.
So when you iterate through it line-wise (with M as the outer loop variable) then you are really going linearly through memory. Something which the CPU cache handles very well.
However, when you iterate with N in the outer loop, then you are always making more or less random jumps in memory (at least for the hardware it looks like that). You are accessing the first cell, then move M integers further and do the same, etc. Since your pages in memory are usually around 4 KiB large, this causes another page to be accessed for each iteration of the inner loop. That way nearly any caching strategy fails and you see a major slowdown.
The trouble is here, how your array is layed in memory.
In the computer memory, arrays are normally allocated such as, that first all columns of the first row are comming, then of the second row and so on.
Your computer memory is best viewed as a long stripe of bytes -- it is a one dimensional array of memory -- not two dimensional, so multi-dimensional arrays have to be allocated in the described way.
Now comes a further problem: Modern CPUs have caches. They have multiple caches and they have so called "cache-lines" for the first-level cache. What does this mean. Access to memory is fast, but not fast enough. Modern CPUs are much faster. So they have their on-chip caches which speed things up. Also they don't access single memory locations any more, but they fill one complete cache-line in one fetch. This also is for performance. But this behaviour gives all operations advantages which process data linearly. When you access first all columns in a row, then the next row and so on -- you are working linearly in fact. When you first process all first columns of all rows, you "jump" in the memory. Thus you always force a new cache-line to be filled, just few bytes can be processed, then the cache-line is possibly invalidated by your next jump ....
Thus column-major-order is bad for modern processors, since it does not work linearly.