Why can adding padding make your loop faster?

Why can adding padding make your loop faster? - c++

People told me that adding padding can help to have better performance because it's using the cache in a better way.
I don't understand how is it possible that by making your data bigger you get better performance.
Can someone understand why?

Array padding
The array padding technique consists of increasing the size of the array dimensions in order to reduce conflict misses when accessing a cache memory.
This type of miss can occur when the number of accessed elements mapping to the same set is greater than the degree of associativity of the cache.
Padding changes the data layout and can be applied (1) between variables (Inter-Variable Padding) or (2) to a variable (Intra-Variable Padding):
1. Inter-Variable Padding
float x[LEN], padding[P], y[LEN];
float redsum() {
float s = 0;
for (int i = 0; i < LEN; i++)
s = s + x[i] + y[i];
return s;
}
If we have a direct mapped cache and the elements x[i] and y[i] are mapped into the same set, accesses to x will evict a block from y and vice versa, resulting in a high miss rate and low performance.
2. Intra-Variable Padding
float x[LEN][LEN+PAD], y[LEN][LEN];
void symmetrize() {
for (int i = 0; i < LEN; i++) {
for (int j = 0; j < LEN; j++)
y[i][j] = 0.5 *(x[i][j] + x[j][i]);
}
}
In this case, if the elements of a column are mapped into a small number of sets, their sequence of accesses may lead to conflict misses, so that the spatial locality would not be exploited.
For example, suppose that during the first iteration of the outer loop, the block containing x[0][0] x[0][1] ... x[0][15] is evicted to store the block containing the element x[k][0]. Then, at the start of the second iteration, the reference to x[0][1] would cause a cache miss.
This technical document analyses the performance of the Fast Fourier Transform (FFT) as a function of the size of the matrix used in the calculations:
https://www.intel.com/content/www/us/en/developer/articles/technical/fft-length-and-layout-advisor.html
References
Gabriel Rivera and Chau-Wen Tseng. Data transformations for eliminating conflict misses. PLDI 1998. DOI: https://doi.org/10.1145/277650.277661
Changwan Hong et al. Effective padding of multidimensional arrays to avoid cache conflict misses. PLDI 2016. DOI: https://doi.org/10.1145/2908080.2908123

I don't think it would matter in a simple loop.
Have a look at this answer: Does alignment really matter for performance in C++11?
The most interesting bit for you from that answer is probably that you could arrange your classes so that members used together are in one cache line and those used by different threads are not.

Related

If I have two functions copyij() and copyji(), to copy a 2048x2048 integer array, which one is faster and why? [duplicate]

Here is simple C++ code that compare iterating 2D array row major with column major.
#include <iostream>
#include <ctime>
using namespace std;
const int d = 10000;
int** A = new int* [d];
int main(int argc, const char * argv[]) {
for(int i = 0; i < d; ++i)
A[i] = new int [d];
clock_t ColMajor = clock();
for(int b = 0; b < d; ++b)
for(int a = 0; a < d; ++a)
A[a][b]++;
double col = static_cast<double>(clock() - ColMajor) / CLOCKS_PER_SEC;
clock_t RowMajor = clock();
for(int a = 0; a < d; ++a)
for(int b = 0; b < d; ++b)
A[a][b]++;
double row = static_cast<double>(clock() - RowMajor) / CLOCKS_PER_SEC;
cout << "Row Major : " << row;
cout << "\nColumn Major : " << col;
return 0;
}
Result for different values of d:
d = 10^3 :
Row Major : 0.002431
Column Major : 0.017186
d = 10^4 :
Row Major : 0.237995
Column Major : 2.04471
d = 10^5
Row Major : 53.9561
Column Major : 444.339
Now the question is why row major is faster than column major?

It obviously depends on the machine you're on but very generally speaking:
Your computer stores parts of your program's memory in a cache that has a much smaller latency than main memory (even when compensating for cache hit time).
C arrays are stored in a contiguous by row major order. This means if you ask for element x, then element x+1 is stored in main memory at a location directly following where x is stored.
It's typical for your computer cache to "pre-emptively" fill cache with memory addresses that haven't been used yet, but that are locally close to memory that your program has used already. Think of your computer as saying: "well, you wanted memory at address X so I am going to assume that you will shortly want memory at X+1, therefore I will pre-emptively grab that for you and place it in your cache".
When you enumerate your array via row major order, you're enumerating it in such a way where it's stored in a contiguous manner in memory, and your machine has already taken the liberty of pre-loading those addresses into cache for you because it guessed that you wanted it. Therefore you achieve a higher rate of cache hits. When you're enumerating an array in another non-contiguous manner then your machine likely won't predict the memory access pattern you're applying, so it wont be able to pre-emptively pull memory addresses into cache for you, and you won't incur as many cache hits, so main memory will have to be accessed more frequently which is slower than your cache.
Also, this might be better suited for https://cs.stackexchange.com/ because the way your system cache behaves is implemented in hardware, and spatial locality questions seem better suited there.

Your array is actually a ragged array, so row major isn't entirely a factor.
You're seeing better performance iterating over columns then rows because the row memory is laid out linearly, which reading sequentially is easy for the cache predictor to predict, and you amortize the pointer dereference to the second dimension since it only needs to be done once per row.
When you iterate over the rows then columns, you incur a pointer dereference to the second dimension per iteration. So by iterating over rows, you're adding a pointer dereference. Aside from the intrinsic cost, it's bad for cache prediction.
If you want a true two-dimensional array, laid out in memory using row-major ordering, you would want...
int A[1000][1000];
This lays out the memory contiguously in row-major order, instead of one array of pointers to arrays (which are not laid out contiguously). Iterating over this array using row-major would still perform faster than iterating column-major because of spatial locality and cache prediction.

The short answer is CPU caches.
Scott Mayers explains it very clearly here

Why is iterating 2D array row major faster than column major?

Here is simple C++ code that compare iterating 2D array row major with column major.
#include <iostream>
#include <ctime>
using namespace std;
const int d = 10000;
int** A = new int* [d];
int main(int argc, const char * argv[]) {
for(int i = 0; i < d; ++i)
A[i] = new int [d];
clock_t ColMajor = clock();
for(int b = 0; b < d; ++b)
for(int a = 0; a < d; ++a)
A[a][b]++;
double col = static_cast<double>(clock() - ColMajor) / CLOCKS_PER_SEC;
clock_t RowMajor = clock();
for(int a = 0; a < d; ++a)
for(int b = 0; b < d; ++b)
A[a][b]++;
double row = static_cast<double>(clock() - RowMajor) / CLOCKS_PER_SEC;
cout << "Row Major : " << row;
cout << "\nColumn Major : " << col;
return 0;
}
Result for different values of d:
d = 10^3 :
Row Major : 0.002431
Column Major : 0.017186
d = 10^4 :
Row Major : 0.237995
Column Major : 2.04471
d = 10^5
Row Major : 53.9561
Column Major : 444.339
Now the question is why row major is faster than column major?

Your array is actually a ragged array, so row major isn't entirely a factor.
You're seeing better performance iterating over columns then rows because the row memory is laid out linearly, which reading sequentially is easy for the cache predictor to predict, and you amortize the pointer dereference to the second dimension since it only needs to be done once per row.
When you iterate over the rows then columns, you incur a pointer dereference to the second dimension per iteration. So by iterating over rows, you're adding a pointer dereference. Aside from the intrinsic cost, it's bad for cache prediction.
If you want a true two-dimensional array, laid out in memory using row-major ordering, you would want...
int A[1000][1000];
This lays out the memory contiguously in row-major order, instead of one array of pointers to arrays (which are not laid out contiguously). Iterating over this array using row-major would still perform faster than iterating column-major because of spatial locality and cache prediction.

The short answer is CPU caches.
Scott Mayers explains it very clearly here

Use map instead of array in C++ to protect searching outside of array bounds?

I have a gridded rectangular file that I have read into an array. This gridded file contains data values and NODATA values; the data values make up a continuous odd shape inside of the array, with NODATA values filling in the rest to keep the gridded file rectangular. I perform operations on the data values and skip the NODATA values.
The operations I perform on the data values consist of examining the 8 surrounding neighbors (the current cell is the center of a 3x3 grid). I can handle when any of the eight neighbors are NODATA values, but when actual data values fall in the first or last row/column, I trigger an error by trying to access an array value that doesn't exist.
To get around this I have considered three options:
Add a new first and last row/column with NODATA values, and adjust my code accordingly - I can cycle through the internal 'original' array and handle the new NODATA values like the edges I'm already handling that don't fall in the first and last row/column.
I can create specific processes for handling the cells in first and last row/column that have data - modified for loops (a for loop that steps through a specific sequence/range) that only examine the surrounding cells that exist, though since I still need 8 neighboring values (NODATA/non-existent cells are given the same value as the central cell) I would have to copy blank/NODATA values to a secondary 3x3 grid. Though there maybe a way to avoid the secondary grid. This solution is annoying as I have to code up specialized routines to all corner cells (4 different for loops) and any cell in the 1st or last row/column (another 4 different for loops). With a single for loop for any non-edge cell.
Use a map, which based on my reading, appears capable of storing the original array while letting me search for locations outside the array without triggering an error. In this case, I still have to give these non-existent cells a value (equal to the center of the array) and so may or may not have to set up a secondary 3x3 grid as well; once again there maybe a way to avoid the secondary grid.
Solution 1 seems the simplest, solution 3 the most clever, and 2 the most annoying. Are there any solutions I'm missing? Or does one of these solutions deserve to be the clear winner?

My advice is to replace all read accesses to the array by a function. For example, arr[i][j] by getarr(i,j). That way, all your algorithmic code stays more or less unchanged and you can easily return NODATA for indices outside bounds.
But I must admit that it is only my opinion.

I've had to do this before and the fastest solution was to expand the region with NODATA values and iterate over the interior. This way the core loop is simple for the compiler to optimize.
If this is not a computational hot-spot in the code, I'd go with Serge's approach instead though.
To minimize rippling effects I used an array structure with explicit row/column strides, something like this:
class Grid {
private:
shared_ptr<vector<double>> data;
int origin;
int xStride;
int yStride;
public:
Grid(int nx, int ny) :
data( new vector<double>(nx*ny) ),
origin(0),
xStride(1),
yStride(nx) {
}
Grid(int nx, int ny, int padx, int pady) :
data( new vector<double>((nx+2*padx)*(ny+2*pady));
xStride(1),
yStride(nx+2*padx),
origin(nx+3*padx) {
}
double& operator()(int x, int y) {
return (*data)[origin + x*xStride + y*yStride];
}
}
Now you can do
Grid g(5,5,1,1);
Grid g2(5,5);
//Initialise
for(int i=0; i<5; ++i) {
for(int j=0; j<5; ++j) {
g(i,j)=i+j;
}
}
// Convolve (note we don't care about going outside the
// range, and our indices are unchanged between the two
// grids.
for(int i=0; i<5; ++i) {
for(int j=0; j<5; ++j) {
g2(i,j)=0;
g2(i,j)+=g(i-1,j);
g2(i,j)+=g(i+1,j);
g2(i,j)+=g(i,j-1);
g2(i,j)+=g(i,j+1);
}
}
Aside: This data structure is awesome for working with transposes, and sub-matrices. Each of those is just an adjustment of the offset and stride values.

Solution 1 is the standard solution. It takes maximum advantage of modern computer architectures, where a few bytes of memory are no big deal, and correct instruction prediction accelerates performance. As you keep accessing memory in a predictable pattern (with fixed strides), the CPU prefetcher will successfully read ahead.
Solution 2 saves a small amount of memory, but the special handling of the edges incurs a real slowdown. Still, the large chunk in the middle benefits from the prefetcher.
Solution 3 is horrible. Map access is O(log N) instead of O(1), and in practice it can be 10-20 times slower. Maps have poor locality of reference; the CPU prefetcher will not kick in.

If simple means "easy to read" I'd recommend you declare a class with an overloaded [] operator. Use it like a regular array but it'll have bounds checking to handle NODATA.
If simple means "high performance" and you have sparse grid with isolated DATA consider implementing linked lists to the DATA values and implement optimal operators that go directly to tge DATA values.

1 wastes memory proportional to your overall rectangle size, 3/maps are clumsy here, 2 is actually very easy to do:
T d[X][Y] = ...;
for (int x = 0; x < X; ++x)
for (int y = 0; y < Y; ++y) // move over d[x][y] centres
{
T r[3][3] = { { d[i,j], d[i,j], d[i,j] },
d[i,j], d[i,j], d[i,j] },
d[i,j], d[i,j], d[i,j] } };
for (int i = std::min(0, x-1); i < std::max(X-1, x+1); ++i)
for (int j = std::min(0, y-1); j < std::max(Y-1, y+1); ++j)
if (d[i][j] != NoData)
r[i-x][j-y] = d[i][j];
// use r for whatever...
}
Note that I'm using signed int very deliberately so x-1 and y-1 don't become huge positive numbers (as they would with say size_t) and break the std::min logic... but you could express it differently if you had some reason to prefer size_t (e.g. x == 0 ? 0 : x - 1).

cache friendly C++ operation on matrix in C++?

My application does some operations on matrices of large size.
I recently came accross the concept of cache & the performance effect it can have through this answer.
I would like to know what would be the best algorithm which is cache friendly for my case.
Algorithm 1:
for(int i = 0; i < size; i++)
{
for(int j = i + 1; j < size; j++)
{
c[i][j] -= K * c[j][j];//K is a constant double variable
}//c is a 2 dimensional array of double variables
}
Algorithm 2:
double *A = new double[size];
for(int n = 0; n < size; n++)
A[n] = c[n][n];
for(int i = 0; i < size; i++)
{
for(int j = i + 1; j < size; j++)
{
c[i][j] -= K * A[j];
}
}
The size of my array is more than 1000x1000.
Benchmarking on my laptop shows Algorithm 2 is better than 1 for size 5000x5000.
Please note that I have multi threaded my application such that a set of rows is operated by a thread.
For example: For array of size 1000x1000.
thread1 -> row 0 to row 249
thread2 -> row 250 to row 499
thread3 -> row 500 to row 749
thread4 -> row 750 to row 999

If your benchmarks show significant improvement for the second case, then it most likely is the better choice. But of course, to know for "an average CPU", we'd have to know that for a large number of CPU's that can be called average - there is no other way. And it will really depend on the definition of Average CPU. Are we talking "any x86 (AMD + Intel) CPU" or "Any random CPU that we can find in anything from a watch to the latest super-fast creation in the x86 range"?
The "copy the data in c[n][n]" method helps because it gets its own address, and doesn't get thrown out of the (L1) cache when the code walks its way over the larger matrix [and all the data you need for the multiplication is "close together". If you walk c[j][j], every j steps will jump sizeof(double) * (size * j + 1) bytes per iteration, so if size is anything more than 4, the next item needed wont be in the same cache-line, so another memory read is needed to get that data.
In other words, for anything that has a decent size cache (bigger than size * sizeof(double)), it's a definite benefit. Even with smaller cache, it's quite likely SOME benefit, but the chances are higher that the cached copy will be thrown out by some part of c[i][j].
In summary, the second algorithm is very likely better for nearly all options.

Algorithm2 benefits from what's called "spatial locality", moving the diagonal into a single dimension array makes it reside in memory in consecutive addresses, and thereby:
Enjoys the benefit of fetching multiple useful elements per a single cache line (presumably 64byte, depending on your CPU), better utilizing cache and memory BW (whereas c[n][n] would also fetch a lot of useless data since it's in the same lines).
Enjoys the benefits of a HW stream prefetchers (assuming such exist in your CPU), that aggressively run ahead of your code along the page and brings the data in advance to the lower cache levels, improving the memory latency.
It should be pointed that moving the data to A doesn't necessarily improve cacheability since A would still compete against a lot of data constantly coming from c and thrashing the cache. However, since it's used over and over, there's a high chance that a good LRU algorithm would make it stay in the cache anyway. You could help that by using streaming memory operations for array c. It should be noted that these are very volatile performance tools, and may on some scenarios lead to perf reduction if not used correctly.
Another potential benefit could come from mixing SW prefetches slightly ahead of reaching every new array line.

Where is the bottleneck in this code?

I have the following tight loop that makes up the serial bottle neck of my code. Ideally I would parallelize the function that calls this but that is not possible.
//n is about 60
for (int k = 0;k < n;k++)
{
double fone = z[k*n+i+1];
double fzer = z[k*n+i];
z[k*n+i+1]= s*fzer+c*fone;
z[k*n+i] = c*fzer-s*fone;
}
Are there any optimizations that can be made such as vectorization or some evil inline that can help this code?
I am looking into finding eigen solutions of tridiagonal matrices. http://www.cimat.mx/~posada/OptDoglegGraph/DocLogisticDogleg/projects/adjustedrecipes/tqli.cpp.html

Short answer: Change the memory layout of your matrix from row-major order to column-major order.
Long answer:
It seems you are accessing the (i)th and (i+1)th column of a matrix stored in row-major order - probably a big matrix that doesn't as a whole fit into CPU cache. Basically, on every loop iteration the CPU has to wait for RAM (in the order of hundred cycles). After a few iteraterations, theoretically, the address prediction should kick in and the CPU should speculatively load the data items even before the loop acesses them. That should help with RAM latency. But that still leaves the problem that the code uses the memory bus inefficiently: CPU and memory never exchange single bytes, only cache-lines (64 bytes on current processors). Of every 64 byte cache-line loaded and stored your code only touches 16 bytes (or a quarter).
Transposing the matrix and accessing it in native major order would increase memory bus utilization four-fold. Since that is probably the bottle-neck of your code, you can expect a speedup of about the same order.
Whether it is worth it, depends on the rest of your algorithm. Other parts may of course suffer because of the changed memory layout.

I take it you are rotating something (or rather, lots of things, by the same angle (s being a sin, c being a cos))?
Counting backwards is always good fun and cuts out variable comparison for each iteration, and should work here. Making the counter the index might save a bit of time also (cuts out a bit of arithmetic, as said by others).
for (int k = (n-1) * n + i; k >= 0; k -= n)
{
double fone=z[k+1];
double fzer=z[k];
z[k+1]=s*fzer+c*fone;
z[k] =c*fzer-s*fone;
}
Nothing dramatic here, but it looks tidier if nothing else.

As first move i'd cache pointers in this loop:
//n is about 60
double *cur_z = &z[0*n+i]
for (int k = 0;k < n;k++)
{
double fone = *(cur_z+1);
double fzer = *cur_z;
*(cur_z+1)= s*fzer+c*fone;
*cur_z = c*fzer-s*fone;
cur_z += n;
}
Second, i think its better to make templatized version of this function. As a result, you can get good perfomance benefit if your matrix holds integer values (since FPU operations are slower).

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js