Which has better memory access ? (C++) [duplicate] - c++

This question already has answers here:
Which ordering of nested loops for iterating over a 2D array is more efficient [duplicate]
(10 answers)
Closed 6 years ago.
Which version is more efficient and why?
It seems that both make the same computations. The only thing I can think of is if the compiler recognizes that in (a) j does not change value and doesn't have to compute it over and over again.
Any input would be great!
#define M /* some mildly large number */
double a[M*M], x[M], c[M];
int i, j;
(a) First version
for (j = 0; j < M; j++)
for (i = 0; i < M; i++)
c[j] += a[i+j*M]*x[i];
(b) Second version
for (i = 0; i < M; i++)
for (j = 0; j < M; j++)
c[j] += a[i+j*M]*x[i];

This is about memory-access patterns rather than computational efficiency. In general (a) is faster because it accesses memory with unit stride, which is much more cache-efficient than (b), which has a stride of M. In the case of (a) each cache line is fully utilised, whereas with (b) it is possible that only one array element will be used from each cache line before it is evicted,
Having said that, some compilers can perform loop reordering optimisations, so in practice you may not see any difference if that happens. As always, you should benchmark/profile your code, rather than just guessing.

Related

How to obtain performance enhancement while multiplying two sub-matrices?

I've got a program multiplying two sub-matrices residing in the same container matrix. I'm trying to obtain some performance gain by using the OpenMP API for parallelization. Below is the multiplication algorithm I use.
#pragma omp parallel for
for(size_t i = 0; i < matrixA.m_edgeSize; i++) {
for(size_t k = 0; k < matrixA.m_edgeSize; k++) {
for(size_t j = 0; j < matrixA.m_edgeSize; j++) {
resultMatrix(i, j) += matrixA(i, k) * matrixB(k, j);
}
}
}
The algorithm accesses the elements of both input sub-matrices row-wise to enhance cache usage with the spatial locality.
What other OpenMP directives can be used to obtain better performance from that simple algorithm? Is there any other directive for optimizing the operations on the overlapping areas of two sub-matrices?
You can assume that all the sub-matrices have the same size and they are square-shaped. The resulting sub-matrix resides in another container matrix.
For the matrix-matrix product, any permutation of i,j,k indices computes the right result, sequentially. In parallel, not so. In your original code the k iterations do not write to unique locations, so you can not just collapse the outer two loops. Do a k,j interchange and then it is allowed.
Of course OpenMP gets you from 5 percent efficiency on one core to 5 percent on all cores. You really want to block the loops. But that is a lot harder. See the paper by Goto and van de Geijn.
I'm adding something related to main matrix. Do you use this code to multiply two bigger matrices? Then one of the sub-matrices are re-used between different iterations and likely to benefit from CPU cache. For example, if there are 4 sub-matrices of a matrix, then each sub-matrix is used twice, to get a value on result matrix.
To benefit from cache most, the re-used data should be kept in the cache of the same thread (core). To do this, maybe it is better to move the work-distribution level to the place where you select two submatrices.
So, something like this:
select sub-matrix A
#pragma omp parallel for
select sub-matrix B
for(size_t i = 0; i < matrixA.m_edgeSize; i++) {
for(size_t k = 0; k < matrixA.m_edgeSize; k++) {
for(size_t j = 0; j < matrixA.m_edgeSize; j++) {
resultMatrix(i, j) += matrixA(i, k) * matrixB(k, j);
}
}
}
could work faster since whole data always stays in same thread (core).

Is this a good practice for vectorization?

I'm trying to improve performance from this code by vectorizing this function:
inline float calcHaarPattern( const int* origin, const SurfHF* f, int n )
{
double d = 0;
for( int k = 0; k < n; k++ )
d += (origin[f[k].p0] + origin[f[k].p3] - origin[f[k].p1] - origin[f[k].p2])*f[k].w;
return (float)d;
}
From my knowledge, you can vectorize loops that involves exactly one math operation. In the code above we have 5 math operations, so (using OMP):
#pragma omp simd
for( int k = 0; k < n; k++ )
d += (origin[f[k].p0] + origin[f[k].p3] - origin[f[k].p1] - origin[f[k].p2])*f[k].w;
Isn't gonna work. However, I was thinking if breaking the loop above into multiple loops with exactly one math operation is a good practice for vectorization? The resulting code would be:
double p0[n], p3[n], p1[n], p2[n];
#pragma omp simd
for( int k = 0; k < n; k++ )
p0[k] = origin[f[k].p0]*f[k].w;
#pragma omp simd
for( int k = 0; k < n; k++ )
p3[k] = origin[f[k].p3]*f[k].w;
#pragma omp simd
for( int k = 0; k < n; k++ )
p1[k] = origin[f[k].p1]*f[k].w;
#pragma omp simd
for( int k = 0; k < n; k++ )
p2[k] = origin[f[k].p2]*f[k].w;
#pragma omp simd
for( int k = 0; k < n; k++ )
d += p0[k];
#pragma omp simd
for( int k = 0; k < n; k++ )
d -= p1[k];
#pragma omp simd
for( int k = 0; k < n; k++ )
d -= p2[k];
#pragma omp simd
for( int k = 0; k < n; k++ )
d += p3[k];
Is this a good solution, or there is any better? Modern compilers (say gcc) are going to do this (or better) kind of optimizations (e.g. enabling -O3) by themselves (so there is actually no gain in performance)?
This is generally bad HPC programming practice by several reasons:
These days you normally have to make your code as computationally dense as possible. To achieve that you need to have the higher Arithmetic Intensity (AI) for the loop whenever you can. For simplicity think of AI as ration of [amount of computations] divided by [number of bytes moved to/from memory in order to perform these computations].
By splitting loops you make AI for each loop much lower in your case, because you do not reuse the same bytes for different computations anymore.
(Vector- or Thread-) -Parallel Reduction (in your case by variable "d") has cost/overhead which you don't want to multiply by 8 or 10 (i.e. by number of loops produced by you by hands).
In many cases Intel/CGG compiler vectorization engines can make slightly better optimization when have several data fields of the same data object processed in the same loop body as opposed to splitted loops case.
There are few theoretical advantages of loop splitting as well, but they don't apply to your case, so I provided them just in case. Loop splitting is reasonable/profitable when:
There is more than one loop carried dependency or reduction in the
same loop.
In case loop split helps to get rid of some out of order execution
data flow dependencies.
In case you do not have enough vector registers to perform all the computations (for complicated algorithms).
Intel Advisor (mentioned by you in prev questons) helps to analyze many of these factors and measures AI.
This is also true that good compilers "don't care" whenever you have one such loop or loop-split, because they could easily transform one case to another or vice versa on the fly.
However the applicability for this kind of transformation is very limited in real codes, because in order to do that you have to know in compile-time a lot of extra info: whenever pointers or dynamic arrays overlap or not, whenever data is aligned or not etc etc. So you shouldn't rely on compiler transformations and specific compiler minor version, but just write HPC-ready code as much as you are capable to.

MATLAB equivalent in c++

In MATLAB inorder to access the odd or even rows and columns of a matrix we use
A = M(1:2:end,1:2:end);
Is there an equivalent for this in C++? or How do i do this in C++.
Basically what i want to do is in matlab i have
A(1:2:end,1:2:end) = B(1:2:end,:);
A(2:2:end,2:2:end) = B(2:2:end,:);
I want to implement the same in C++
This is available only in a fairly obscure class, std::valarray. You need a std::gslice (Generalized slice) with stride {2,2} to access the std::valarray.
In C++ the for loop is constructed as follows
for (initial state; condition for termination; increment)
So if you are looking for the odd elements, you can:
for (int i = 0; i < size; i += 2),
whereas if you are looking for the even elements:
for (int i = 1; i < size; i += 2).
Where size depends if you are looping through the rows or columns. Take into account that because C++ arrays start at index 0, your odd elements will correspond to even indexes and your even elements will correspond to odd indexes.
Now, the answer: If you want to get the elements of a matrix, in C++ you must loop through the matrix with a for loop. You can modify the elements you access by modifying the increment property of the for loop.
for(int i= 0; i < rows/2; i++)
for(int j= 0; j < columns/2; j++)
A[i][j] = M[i*2][j*2];

How to make this code faster (learning best practices)? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
I have this little loop here, and I was wondering if I do some big mistake, perf wise.
For example, is there a way to rewrite parts of it differently, to make vectorization possible (assuming GCC4.8.1 and all vecotrization friendly flags enabled)?
Is this the best way to pass a list a number (const float name_of_var[])?
The idea of the code is to take a vector (in the mathematical sense, not necesserly a std::vector) of (unsorted numbers) y and two bound values (ox[0]<=ox[1]) and to store in a vector of integers rdx the index i of the entry of y satisfying ox[0]<=y[i]<=ox[1].
rdx can contain m elements and y has capacity n and n>m. If there are more than m
values of y[i] satisfying ox[0]<=y[i]<=ox[1] then the code should return the first m
Thanks in advance,
void foo(const int n,const int m,const float y[],const float ox[],int rdx[]){
int d0,j=0,i=0;
for(;;){
i++;
d0=((y[i]>=ox[0])+(y[i]<=ox[1]))/2;
if(d0==1){
rdx[j]=i;
j++;
}
if(j==m) break;
if(i==n-1) break;
}
}
d0=((y[i]>=ox[0])+(y[i]<=ox[1]))/2;
if(d0==1)
I believe the use of an intermediary variable is useless, and take a few more cycles
This is the most optimized version I could think of, but it's totally unreadable...
void foo(int n, int m, float y[],const float ox[],int rdx[])
{
for(int i = 0; i < n && m != 0; i++)
{
if(*y >= *ox && *y <= ox[1])
{
*rdx=i;
rdx++;
m--;
}
y++;
}
}
I think the following version with a decent optimisation flag should do the job
void foo(int n, int m,const float y[],const float ox[],int rdx[])
{
for(int j = 0, i = 0; j < m && i < n; i++) //Reorder to put the condition with the highest probability to fail first
{
if(y[i] >= ox[0] && y[i] <= ox[1])
{
rdx[j++] = i;
}
}
}
Just to make sure I'm correct: you're trying to find the first m+1 (if it's actually m, do j == m-1) values that are in the range of [ ox[0], ox[1] ]?
If so, wouldn't it be better to do:
for (int i=0, j=0;;++i) {
if (y[i] < ox[0]) continue;
if (y[i] > ox[1]) continue;
rdx[j] = i;
j++;
if (j == m || i == n-1) break;
}
If y[i] is indeed in the range you must perform both comparisons as we both do.
If y[i] is under ox[0], no need to perform the second comparison.
I avoid the use of division.
A. Yes, passing the float array as float[] is not only efficient, it is the only way (and is identical to a float * argument).
A1. But in C++ you can use better types without performance loss. Accessing a vector or array (the standard library container) should not be slower than accessing a plain C style array. I would strongly advise you to use those. In modern C++ there is also the possibility to use iterators and functors; I am no expert there but if you can express the independence of operations on different elements by being more abstract you may give the compiler the chance to generate code that is more suitable for vectorization.
B. You should replace the division by a logical AND, operator&&. The first advantage is that the second condition is not evaluated at all if the first one is false -- this could be your most important performance gain here. The second advantage is expressiveness and thus readability.
C. The intermediate variable d0 will probably disappear when you compile with -O3, but it's unnecessary nonetheless.
The rest is ok performancewise. Idiomatically there is room for improvement as has been shown already.
D. I am not sure about a chance for vectorization with the code as presented here. The compiler will probably do some loop unrolling at -O3; try to let it emit SSE code (cf. http://gcc.gnu.org/onlinedocs/, specifically http://gcc.gnu.org/onlinedocs/gcc-4.8.2/gcc/i386-and-x86-64-Options.html#i386-and-x86-64-Options). Who knows.
Oh, I just realized that your original code passes the constant interval boundaries as an array with 2 elements, ox[]. Since array access is an unnecessary indirection and as such may carry an overhead, using two normal float parameters would be preferred here. Keep them const like your array. You could also name them nicely.

compiler nested loop optimization for sequential memory access.

I came across a strange performance issue in a matrix multiply benchmark (matrix_mult in Metis from the MOSBENCH suite). The benchmark was optimized to tile the data such that the active working set was 12kb (3 tiles of 32x32 ints) and would fit into the L1 cache. To make a long story short, swapping the inner two most loops had a performance difference of almost 4x on certain array input sizes (4096, 8192) and about a 30% difference on others. The problem essentially came down to accessing elements sequentially versus in a stride pattern. Certain array sizes I think created a bad stride access that generated a lot cache line collisions. The performance difference is noticeably less when changing from 2-way associative L1 to an 8-way associative L1.
My question is why doesn't gcc optimize loop ordering to maximize sequential memory accesses?
Below is a simplified version of the problem (note that performance times are highly dependent on L1 configuration. The numbers indicated below are from 2.3 GHZ AMD system with 64K L1 2-way associative compiled with -O3).
N = ARRAY_SIZE // 1024
int* mat_A = (int*)malloc(N*N*sizeof(int));
int* mat_B = (int*)malloc(N*N*sizeof(int));
int* mat_C = (int*)malloc(N*N*sizeof(int));
// Elements of mat_B are accessed in a stride pattern of length N
// This takes 800 msec
for (int t = 0; t < 1000; t++)
for (int a = 0; a < 32; a++)
for (int b = 0; b < 32; b++)
for (int c = 0; c < 32; c++)
mat_C[N*a+b] += mat_A[N*a+c] * mat_B[N*c+b];
// Inner two loops are swapped
// Elements are now accessed sequentially in inner loop
// This takes 172 msec
for (int t = 0; t < 1000; t++)
for (int a = 0; a < 32; a++)
for (int c = 0; c < 32; c++)
for (int b = 0; b < 32; b++)
mat_C[N*a+b] += mat_A[N*a+c] * mat_B[N*c+b];
gcc might not be able to prove that the pointers don't overlap. If you are fine using non standard extensions you could try using __restrict.
gcc doesn't take full advantage of your architecture to avoid the necessity to recompile for every processor. Using the option -march with the appropriate value for your system might help.
gcc has a bunch of optimizations that just do what you want.
Look up the -floop-strip-mine and -floop-block compiler options.
Quote from the manual:
Perform loop blocking transformations on loops. Blocking strip mines
each loop in the loop nest such that the memory accesses of the
element loops fit inside caches. The strip length can be changed using
the loop-block-tile-size parameter.