compiler nested loop optimization for sequential memory access. - c++

I came across a strange performance issue in a matrix multiply benchmark (matrix_mult in Metis from the MOSBENCH suite). The benchmark was optimized to tile the data such that the active working set was 12kb (3 tiles of 32x32 ints) and would fit into the L1 cache. To make a long story short, swapping the inner two most loops had a performance difference of almost 4x on certain array input sizes (4096, 8192) and about a 30% difference on others. The problem essentially came down to accessing elements sequentially versus in a stride pattern. Certain array sizes I think created a bad stride access that generated a lot cache line collisions. The performance difference is noticeably less when changing from 2-way associative L1 to an 8-way associative L1.
My question is why doesn't gcc optimize loop ordering to maximize sequential memory accesses?
Below is a simplified version of the problem (note that performance times are highly dependent on L1 configuration. The numbers indicated below are from 2.3 GHZ AMD system with 64K L1 2-way associative compiled with -O3).
N = ARRAY_SIZE // 1024
int* mat_A = (int*)malloc(N*N*sizeof(int));
int* mat_B = (int*)malloc(N*N*sizeof(int));
int* mat_C = (int*)malloc(N*N*sizeof(int));
// Elements of mat_B are accessed in a stride pattern of length N
// This takes 800 msec
for (int t = 0; t < 1000; t++)
for (int a = 0; a < 32; a++)
for (int b = 0; b < 32; b++)
for (int c = 0; c < 32; c++)
mat_C[N*a+b] += mat_A[N*a+c] * mat_B[N*c+b];
// Inner two loops are swapped
// Elements are now accessed sequentially in inner loop
// This takes 172 msec
for (int t = 0; t < 1000; t++)
for (int a = 0; a < 32; a++)
for (int c = 0; c < 32; c++)
for (int b = 0; b < 32; b++)
mat_C[N*a+b] += mat_A[N*a+c] * mat_B[N*c+b];

gcc might not be able to prove that the pointers don't overlap. If you are fine using non standard extensions you could try using __restrict.
gcc doesn't take full advantage of your architecture to avoid the necessity to recompile for every processor. Using the option -march with the appropriate value for your system might help.

gcc has a bunch of optimizations that just do what you want.
Look up the -floop-strip-mine and -floop-block compiler options.
Quote from the manual:
Perform loop blocking transformations on loops. Blocking strip mines
each loop in the loop nest such that the memory accesses of the
element loops fit inside caches. The strip length can be changed using
the loop-block-tile-size parameter.

Related

How to obtain performance enhancement while multiplying two sub-matrices?

I've got a program multiplying two sub-matrices residing in the same container matrix. I'm trying to obtain some performance gain by using the OpenMP API for parallelization. Below is the multiplication algorithm I use.
#pragma omp parallel for
for(size_t i = 0; i < matrixA.m_edgeSize; i++) {
for(size_t k = 0; k < matrixA.m_edgeSize; k++) {
for(size_t j = 0; j < matrixA.m_edgeSize; j++) {
resultMatrix(i, j) += matrixA(i, k) * matrixB(k, j);
}
}
}
The algorithm accesses the elements of both input sub-matrices row-wise to enhance cache usage with the spatial locality.
What other OpenMP directives can be used to obtain better performance from that simple algorithm? Is there any other directive for optimizing the operations on the overlapping areas of two sub-matrices?
You can assume that all the sub-matrices have the same size and they are square-shaped. The resulting sub-matrix resides in another container matrix.
For the matrix-matrix product, any permutation of i,j,k indices computes the right result, sequentially. In parallel, not so. In your original code the k iterations do not write to unique locations, so you can not just collapse the outer two loops. Do a k,j interchange and then it is allowed.
Of course OpenMP gets you from 5 percent efficiency on one core to 5 percent on all cores. You really want to block the loops. But that is a lot harder. See the paper by Goto and van de Geijn.
I'm adding something related to main matrix. Do you use this code to multiply two bigger matrices? Then one of the sub-matrices are re-used between different iterations and likely to benefit from CPU cache. For example, if there are 4 sub-matrices of a matrix, then each sub-matrix is used twice, to get a value on result matrix.
To benefit from cache most, the re-used data should be kept in the cache of the same thread (core). To do this, maybe it is better to move the work-distribution level to the place where you select two submatrices.
So, something like this:
select sub-matrix A
#pragma omp parallel for
select sub-matrix B
for(size_t i = 0; i < matrixA.m_edgeSize; i++) {
for(size_t k = 0; k < matrixA.m_edgeSize; k++) {
for(size_t j = 0; j < matrixA.m_edgeSize; j++) {
resultMatrix(i, j) += matrixA(i, k) * matrixB(k, j);
}
}
}
could work faster since whole data always stays in same thread (core).

What is the most efficient way to repeat elements in a vector and apply a set of different functions across all elements using Eigen?

Say I have a vector containing only positive, real elements defined like this:
Eigen::VectorXd v(1.3876, 8.6983, 5.438, 3.9865, 4.5673);
I want to generate a new vector v2 that has repeated the elements in v some k times. Then I want to apply k different functions to each of the repeated elements in the vector.
For example, if v2 was v repeated 2 times and I applied floor() and ceil() as my two functions, the result based on the above vector would be a column vector with values: [1; 2; 8; 9; 5; 6; 3; 4; 4; 5]. Preserving the order of the original values is important here as well. These values are also a simplified example, in practice, I'm generating vectors v with ~100,000 or more elements and would like to make my code as vectorizable as possible.
Since I'm coming to Eigen and C++ from Matlab, the simplest approach I first took was to just convert this Nx1 vector into an Nx2 matrix, apply floor to the first column and ceil to the second column, take the transpose to get a 2xN matrix and then exploit the column-major nature of the matrix and reshape the 2xN matrix into a 2Nx1 vector, yielding the result I want. However, for large vectors, this would be very slow and inefficient.
This response by ggael effectively addresses how I could repeat the elements in the input vector by generating a sequence of indices and indexing the input vector. I could just then generate more sequences of indices to apply my functions to the relevant elements v2 and copy the result back to their respective places. However, is this really the most efficient approach? I dont fully grasp copy-on-write and move semantics, but I think the second indexing expressions would be in a sense redundant?
If that is true, then my guess is that a solution here would be some sort of nullary or unary expression where I could define an expression that accepts the vector, some index k and k expressions/functions to apply to each element and spits out the vector I'm looking for. I've read the Eigen documentation on the subject, but I'm struggling to build a functional example. Any help would be appreciated!
So, if I understand you correctly, you don't want to replicate (in terms of Eigen methods) the vector, you want to apply different methods to the same elements and store the result for each, correct?
In this case, computing it sequentially once per function is the easiest route. Most CPUs can only do one (vector) memory store per clock cycle, anyway. So for simple unary or binary operations, your gains have an upper bound.
Still, you are correct that one load is technically always better than two and it is a limitation of Eigen that there is no good way of achieving this.
Know that even if you manually write a loop that would generate multiple outputs, you should limit yourself in the number of outputs. CPUs have a limited number of line-fill buffers. IIRC Intel recommended using less than 10 "output streams" in tight loops, otherwise you could stall the CPU on those.
Another aspect is that C++'s weak aliasing restrictions make it hard for compilers to vectorize code with multiple outputs. So it might even be detrimental.
How I would structure this code
Remember that Eigen is column-major, just like Matlab. Therefore use one column per output function. Or just use separate vectors to begin with.
Eigen::VectorXd v = ...;
Eigen::MatrixX2d out(v.size(), 2);
out.col(0) = v.array().floor();
out.col(1) = v.array().ceil();
Following the KISS principle, this is good enough. You will not gain much if anything by doing something more complicated. A bit of multithreading might gain you something (less than factor 2 I would guess) because a single CPU thread is not enough to max out memory bandwidth but that's about it.
Some benchmarking
This is my baseline:
int main()
{
int rows = 100013, repetitions = 100000;
Eigen::VectorXd v = Eigen::VectorXd::Random(rows);
Eigen::MatrixX2d out(rows, 2);
for(int i = 0; i < repetitions; ++i) {
out.col(0) = v.array().floor();
out.col(1) = v.array().ceil();
}
}
Compiled with gcc-11, -O3 -mavx2 -fno-math-errno I get ca. 5.7 seconds.
Inspecting the assembler code finds good vectorization.
Plain old C++ version:
double* outfloor = out.data();
double* outceil = outfloor + out.outerStride();
const double* inarr = v.data();
for(std::ptrdiff_t j = 0; j < rows; ++j) {
const double vj = inarr[j];
outfloor[j] = std::floor(vj);
outceil[j] = std::ceil(vj);
}
40 seconds instead of 5! This version actually does not vectorize because the compiler cannot prove that the arrays don't alias each other.
Next, let's use fixed size Eigen vectors to get the compiler to generate vectorized code:
double* outfloor = out.data();
double* outceil = outfloor + out.outerStride();
const double* inarr = v.data();
std::ptrdiff_t j;
for(j = 0; j + 4 <= rows; j += 4) {
const Eigen::Vector4d vj = Eigen::Vector4d::Map(inarr + j);
const auto floorval = vj.array().floor();
const auto ceilval = vj.array().ceil();
Eigen::Vector4d::Map(outfloor + j) = floorval;
Eigen::Vector4d::Map(outceil + j) = ceilval;;
}
if(j + 2 <= rows) {
const Eigen::Vector2d vj = Eigen::Vector2d::MapAligned(inarr + j);
const auto floorval = vj.array().floor();
const auto ceilval = vj.array().ceil();
Eigen::Vector2d::Map(outfloor + j) = floorval;
Eigen::Vector2d::Map(outceil + j) = ceilval;;
j += 2;
}
if(j < rows) {
const double vj = inarr[j];
outfloor[j] = std::floor(vj);
outceil[j] = std::ceil(vj);
}
7.5 seconds. The assembler looks fine, fully vectorized. I'm not sure why performance is lower. Maybe cache line aliasing?
Last attempt: We don't try to avoid re-reading the vector but we re-read it blockwise so that it will be in cache by the time we read it a second time.
const int blocksize = 64 * 1024 / sizeof(double);
std::ptrdiff_t j;
for(j = 0; j + blocksize <= rows; j += blocksize) {
const auto& vj = v.segment(j, blocksize);
auto outj = out.middleRows(j, blocksize);
outj.col(0) = vj.array().floor();
outj.col(1) = vj.array().ceil();
}
const auto& vj = v.tail(rows - j);
auto outj = out.bottomRows(rows - j);
outj.col(0) = vj.array().floor();
outj.col(1) = vj.array().ceil();
5.4 seconds. So there is some gain here but not nearly enough to justify the added complexity.

Why can adding padding make your loop faster?

People told me that adding padding can help to have better performance because it's using the cache in a better way.
I don't understand how is it possible that by making your data bigger you get better performance.
Can someone understand why?
Array padding
The array padding technique consists of increasing the size of the array dimensions in order to reduce conflict misses when accessing a cache memory.
This type of miss can occur when the number of accessed elements mapping to the same set is greater than the degree of associativity of the cache.
Padding changes the data layout and can be applied (1) between variables (Inter-Variable Padding) or (2) to a variable (Intra-Variable Padding):
1. Inter-Variable Padding
float x[LEN], padding[P], y[LEN];
float redsum() {
float s = 0;
for (int i = 0; i < LEN; i++)
s = s + x[i] + y[i];
return s;
}
If we have a direct mapped cache and the elements x[i] and y[i] are mapped into the same set, accesses to x will evict a block from y and vice versa, resulting in a high miss rate and low performance.
2. Intra-Variable Padding
float x[LEN][LEN+PAD], y[LEN][LEN];
void symmetrize() {
for (int i = 0; i < LEN; i++) {
for (int j = 0; j < LEN; j++)
y[i][j] = 0.5 *(x[i][j] + x[j][i]);
}
}
In this case, if the elements of a column are mapped into a small number of sets, their sequence of accesses may lead to conflict misses, so that the spatial locality would not be exploited.
For example, suppose that during the first iteration of the outer loop, the block containing x[0][0] x[0][1] ... x[0][15] is evicted to store the block containing the element x[k][0]. Then, at the start of the second iteration, the reference to x[0][1] would cause a cache miss.
This technical document analyses the performance of the Fast Fourier Transform (FFT) as a function of the size of the matrix used in the calculations:
https://www.intel.com/content/www/us/en/developer/articles/technical/fft-length-and-layout-advisor.html
References
Gabriel Rivera and Chau-Wen Tseng. Data transformations for eliminating conflict misses. PLDI 1998. DOI: https://doi.org/10.1145/277650.277661
Changwan Hong et al. Effective padding of multidimensional arrays to avoid cache conflict misses. PLDI 2016. DOI: https://doi.org/10.1145/2908080.2908123
I don't think it would matter in a simple loop.
Have a look at this answer: Does alignment really matter for performance in C++11?
The most interesting bit for you from that answer is probably that you could arrange your classes so that members used together are in one cache line and those used by different threads are not.

cache friendly C++ operation on matrix in C++?

My application does some operations on matrices of large size.
I recently came accross the concept of cache & the performance effect it can have through this answer.
I would like to know what would be the best algorithm which is cache friendly for my case.
Algorithm 1:
for(int i = 0; i < size; i++)
{
for(int j = i + 1; j < size; j++)
{
c[i][j] -= K * c[j][j];//K is a constant double variable
}//c is a 2 dimensional array of double variables
}
Algorithm 2:
double *A = new double[size];
for(int n = 0; n < size; n++)
A[n] = c[n][n];
for(int i = 0; i < size; i++)
{
for(int j = i + 1; j < size; j++)
{
c[i][j] -= K * A[j];
}
}
The size of my array is more than 1000x1000.
Benchmarking on my laptop shows Algorithm 2 is better than 1 for size 5000x5000.
Please note that I have multi threaded my application such that a set of rows is operated by a thread.
For example: For array of size 1000x1000.
thread1 -> row 0 to row 249
thread2 -> row 250 to row 499
thread3 -> row 500 to row 749
thread4 -> row 750 to row 999
If your benchmarks show significant improvement for the second case, then it most likely is the better choice. But of course, to know for "an average CPU", we'd have to know that for a large number of CPU's that can be called average - there is no other way. And it will really depend on the definition of Average CPU. Are we talking "any x86 (AMD + Intel) CPU" or "Any random CPU that we can find in anything from a watch to the latest super-fast creation in the x86 range"?
The "copy the data in c[n][n]" method helps because it gets its own address, and doesn't get thrown out of the (L1) cache when the code walks its way over the larger matrix [and all the data you need for the multiplication is "close together". If you walk c[j][j], every j steps will jump sizeof(double) * (size * j + 1) bytes per iteration, so if size is anything more than 4, the next item needed wont be in the same cache-line, so another memory read is needed to get that data.
In other words, for anything that has a decent size cache (bigger than size * sizeof(double)), it's a definite benefit. Even with smaller cache, it's quite likely SOME benefit, but the chances are higher that the cached copy will be thrown out by some part of c[i][j].
In summary, the second algorithm is very likely better for nearly all options.
Algorithm2 benefits from what's called "spatial locality", moving the diagonal into a single dimension array makes it reside in memory in consecutive addresses, and thereby:
Enjoys the benefit of fetching multiple useful elements per a single cache line (presumably 64byte, depending on your CPU), better utilizing cache and memory BW (whereas c[n][n] would also fetch a lot of useless data since it's in the same lines).
Enjoys the benefits of a HW stream prefetchers (assuming such exist in your CPU), that aggressively run ahead of your code along the page and brings the data in advance to the lower cache levels, improving the memory latency.
It should be pointed that moving the data to A doesn't necessarily improve cacheability since A would still compete against a lot of data constantly coming from c and thrashing the cache. However, since it's used over and over, there's a high chance that a good LRU algorithm would make it stay in the cache anyway. You could help that by using streaming memory operations for array c. It should be noted that these are very volatile performance tools, and may on some scenarios lead to perf reduction if not used correctly.
Another potential benefit could come from mixing SW prefetches slightly ahead of reaching every new array line.

Optimizing this code block

for (int i = 0; i < 5000; i++)
for (int j = 0; j < 5000; j++)
{
for (int ii = 0; ii < 20; ii++)
for (int jj = 0; jj < 20; jj++)
{
int num = matBigger[i+ii][j+jj];
// Extract range from this.
int low = num & 0xff;
int high = num >> 8;
if (low < matSmaller[ii][jj] && matSmaller[ii][jj] > high)
// match found
}
}
The machine is x86_64, 32kb L1 cahce, 256 Kb L2 cache.
Any pointers on how can I possibly optimize this code?
EDIT Some background to the original problem : Fastest way to Find a m x n submatrix in M X N matrix
First thing I'd try is to move the ii and jj loops outside the i and j loops. That way you're using the same elements of matSmaller for 25 million iterations of the i and j loops, meaning that you (or the compiler if you're lucky) can hoist the access to them outside those loops:
for (int ii = 0; ii < 20; ii++)
for (int jj = 0; jj < 20; jj++)
int smaller = matSmaller[ii][jj];
for (int i = 0; i < 5000; i++)
for (int j = 0; j < 5000; j++) {
int num = matBigger[i+ii][j+jj];
int low = num & 0xff;
if (low < smaller && smaller > (num >> 8)) {
// match found
}
}
This might be faster (thanks to less access to the matSmaller array), or it might be slower (because I've changed the pattern of access to the matBigger array, and it's possible that I've made it less cache-friendly). A similar alternative would be to move the ii loop outside i and j and hoist matSmaller[ii], but leave the jj loop inside. The rule of thumb is that it's more cache-friendly to increment the last index of a multi-dimensional array in your inner loops, than earlier indexes. So we're "happier" to modify jj and j than we are to modify ii and i.
Second thing I'd try - what's the type of matBigger? Looks like the values in it are only 16 bits, so try it both as int and as (u)int16_t. The former might be faster because aligned int access is fast. The latter might be faster because more of the array fits in cache at any one time.
There are some higher-level things you could consider with some early analysis of smaller: for example if it's 0 then you needn't examine matBigger for that value of ii and jj, because num & 0xff < 0 is always false.
To do better than "guess things and see whether they're faster or not" you need to know for starters which line is hottest, which means you need a profiler.
Some basic advice:
Profile it, so you can learn where the hot-spots are.
Think about cache locality, and the addresses resulting from your loop order.
Use more const in the innermost scope, to hint more to the compiler.
Try breaking it up so you don't compute high if the low test is failing.
Try maintaining the offset into matBigger and matSmaller explicitly, to the innermost stepping into a simple increment.
Best thing ist to understand what the code is supposed to do, then check whether another algorithm exists for this problem.
Apart from that:
if you are just interested if a matching entry exists, make sure to break out of all 3 loops at the position of // match found.
make sure the data is stored in an optimal way. It all depends on your problem, but i.e. it could be more efficient to have just one array of size 5000*5000*20 and overload operator()(int,int,int) for accessing elements.
What are matSmaller and matBigger?
Try changing them to matBigger[i+ii * COL_COUNT + j+jj]
I agree with Steve about rearranging your loops to have the higher count as the inner loop. Since your code is only doing loads and compares, I believe a significant portion of the time is used for pointer arithmetic. Try an experiment to change Steve's answer into this:
for (int ii = 0; ii < 20; ii++)
{
for (int jj = 0; jj < 20; jj++)
{
int smaller = matSmaller[ii][jj];
for (int i = 0; i < 5000; i++)
{
int *pI = &matBigger[i+ii][jj];
for (int j = 0; j < 5000; j++)
{
int num = *pI++;
int low = num & 0xff;
if (low < smaller && smaller > (num >> 8)) {
// match found
} // for j
} // for i
} // for jj
} // for ii
Even in 64-bit mode, the C compiler doesn't necessarily do a great job of keeping everything in register. By changing the array access to be a simple pointer increment, you'll make the compiler's job easier to produce efficient code.
Edit: I just noticed #unwind suggested basically the same thing. Another issue to consider is the statistics of your comparison. Is the low or high comparison more probable? Arrange the conditional statement so that the less probable test is first.
Looks like there is a lot of repetition here. One optimization is to reduce the amount of duplicate effort. Using pen and paper, I'm showing the matBigger "i" index iterating as:
[0 + 0], [0 + 1], [0 + 2], ..., [0 + 19],
[1 + 0], [1 + 1], ..., [1 + 18], [1 + 19]
[2 + 0], ..., [2 + 17], [2 + 18], [2 + 19]
As you can see there are locations that are accessed many times.
Also, multiplying the iteration counts indicate that the inner content is accessed: 20 * 20 * 5000 * 5000, or 10000000000 (10E+9) times. That's a lot!
So rather than trying to speed up the execution of 10E9 instructions (such as execution (pipeline) cache or data cache optimization), try reducing the number of iterations.
The code is searcing the matrix for a number that is within a range: larger than a minimal value and less than the maximum range value.
Based on this, try a different approach:
Find and remember all coordinates where the search value is greater
than the low value. Let us call these anchor points.
For each anchor point, find the coordinates of the first value after
the anchor point that is outside the range.
The objective is to reduce the number of duplicate accesses. Anchor points allow for a one pass scan and allow other decisions such as finding a range or determining an MxN matrix that contains the anchor value.
Another idea is to create new data structures containing the matBigger and matSmaller that are more optimized for searching.
For example, create a {value, coordinate list} entry for each unique value in matSmaller:
Value coordinate list
26 -> (2,3), (6,5), ..., (1007, 75)
31 -> (4,7), (2634, 5), ...
Now you can use this data structure to find values in matSmaller and immediately know their locations. So you could search matBigger for each unique value in this data structure. This again reduces the number of access to the matrices.