How does two array lookup codes have different speeds [duplicate] - c++

This question already has answers here:
Accessing elements of a matrix row-wise versus column-wise
(3 answers)
Closed 4 years ago.
I have an array, long matrix[8*1024][8*1024], and two functions sum1 and sum2:
long sum1(long m[ROWS][COLS]) {
long register sum = 0;
int i,j;
for (i=0; i < ROWS; i++) {
for (j=0; j < COLS; j++) {
sum += m[i][j];
}
}
return sum;
}
long sum2(long m[ROWS][COLS]) {
long register sum = 0;
int i,j;
for (j=0; j < COLS; j++) {
for (i=0; i < ROWS; i++) {
sum += m[i][j];
}
}
return sum;
}
When I execute the two functions with the given array, I get running times:
sum1: 0.19s
sum2: 1.25s
Can anyone explain why there is this huge difference?

C uses row-major ordering to store multidimensional arrays, as documented in § 6.5.2.1 Array subscripting, paragraph 3 of the C Standard:
Successive subscript operators designate an element of a multidimensional array object. If E is an n-dimensional array (n >= 2) with dimensions i x j x . . . x k, then E (used as other than an lvalue) is converted to a pointer to an (n - 1)-dimensional array with dimensions j x . . . x k. If the unary * operator is applied to this pointer explicitly, or implicitly as a result of subscripting, the result is the referenced (n - 1)-dimensional array, which itself is converted into a pointer if used as other than an lvalue. It follows from this that arrays are stored in row-major order (last subscript varies fastest).
Emphasis mine.
Here's an image from Wikipedia that demonstrates this storage technique compared to the other method for storing multidimensional arrays, column-major ordering:
The first function, sum1, accesses data consecutively per how the 2D array is actually represented in memory, so the data from the array is already in the cache. sum2 requires fetching of another row on each iteration, which is less likely to be in the cache.
There are some other languages that use column-major ordering for multidimensional arrays; among them are R, FORTRAN and MATLAB. If you wrote equivalent code in these languages you would observe faster output with sum2.

Computers generally use cache to help speed up access to main memory.
The hardware usually used for main memory is relatively slow—it can take many processor cycles for data to come from main memory to the processor. So a computer generally includes a smaller amount very fast but expensive memory called cache. Computers may have several levels of cache, some of it is built into the processor or the processor chip itself and some of it is located outside the processor chip.
Since the cache is smaller, it cannot hold everything in main memory. It often cannot even hold everything that one program is using. So the processor has to make decisions about what is kept in cache.
The most frequent accesses of a program are to consecutive locations in memory. Very often, after a program reads element 237 of an array, it will soon read 238, then 239, and so on. It is less often that it reads 7024 just after reading 237.
So the operation of cache is designed to keep portions of main memory that are consecutive in cache. Your sum1 program works well with this because it changes the column index most rapidly, keeping the row index constant while all the columns are processed. The array elements it accesses are laid out consecutively in memory.
Your sum2 program does not work well with this because it changes the row index most rapidly. This skips around in memory, so many of the accesses it makes are not satisfied by cache and have to come from slower main memory.
Related Resource: Memory layout of multi-dimensional arrays

On a machine with data cache (even a 68030 has one), reading/writing data in consecutive memory locations is way faster, because a block of memory (size depends on the processor) is fetched once from memory and then recalled from the cache (read operation) or written all at once (cache flush for write operation).
By "skipping" data (reading far from the previous read), the CPU has to read the memory again.
That's why your first snippet is faster.
For more complex operations (fast fourier transform for instance), where data is read more than once (unlike your example) a lot of libraries (FFTW for instance) propose to use a stride to accomodate your data organization (in rows/in columns). Never use it, always transpose your data first and use a stride of 1, it will be faster than trying to do it without transposition.
To make sure your data is consecutive, never use 2D notation. First position your data in the selected row and set a pointer to the start of the row, then use an inner loop on that row.
for (i=0; i < ROWS; i++) {
const long *row = m[i];
for (j=0; j < COLS; j++) {
sum += row[j];
}
}
If you cannot do this, that means that your data is wrongly oriented.

This is an issue with the cache.
The cache will automatically read data that lies after the data you requested. So if you read the data row by row, the next data you request will already be in the cache.

A matrix, in memory, is align linearly, such that the items in a row are next to each other in memory (spacial locality). When you transverse items in order such that you go through all of the columns in a row before moving onto the next one, when the CPU comes across an entry that isn't loaded into its cache yet, it will go and load that value along with a whole block of other values close to it in physical memory so the next several values will already be cached by the time it needs to read them.
When you transverse them the other way, the other values it loads in that are near it in memory are not going to be the next ones read, so you wind up with a lot more cache misses and so the CPU has to sit and wait while the data is brought in from the next layer of the memory hierarchy.
By the time you swing back around to another entry that you had previously cached, it more than likely has been booted out of the cache in favor of all the other data you've since loaded in as it will not have been recently used anymore (temporal locality)

To expand on the other answers that this is due to cache-misses for the second program, and assuming that you are using Linux, *BSD, or MacOS, then Cachegrind may give you enlightenment. It's part of valgrind, and will run your program, without changes, and print the cache usage statistics. It does run very slowly though.
http://valgrind.org/docs/manual/cg-manual.html

Related

How to change sub-matrix of a sparse matrix on CUDA device

I have a sparse matrix structure that I am using in conjunction with CUBLAS to implement a linear solver class. I anticipate that the dimensions of the sparse matrices I will be solving will be fairly large (on the order of 10^7 by 10^7).
I will also anticipate that the solver will need to be used many times and that a portion of this matrix will need be updated several times (between computing solutions) as well.
Copying the entire matrix sturcture from system memory to device memory could become quite a performance bottle neck since only a fraction of the matrix entries will ever need to be changed at a given time.
What I would like to be able to do is to have a way to update only a particular sub-set / sub-matrix rather than recopy the entire matrix structure from system memory to device memory each time I need to change the matrix.
The matrix data structure would reside on the CUDA device in arrays:
d_col, d_row, and d_val
On the system side I would have corresponding arrays I, J, and val.
So ideally, I would only want to change the subsets of d_val that correspond to the values in the system array, val, that changed.
Note that I do not anticipate that any entries will be added to or removed from the matrix, only that existing entries will change in value.
Naively I would think that to implement this, I would have an integer array or vector on the host side, e.g. updateInds , that would track the indices of entries in val that have changed, but I'm not sure how to efficiently tell the CUDA device to update the corresponding values of d_val.
In essence: how do I change the entries in a CUDA device side array (d_val) at indicies updateInds[1],updateInds[2],...,updateInds[n] to a new set of values val[updatInds[1]], val[updateInds[2]], ..., val[updateInds[3]], with out recopying the entire val array from system memory into CUDA device memory array d_val?
As long as you only want to change the numerical values of the value array associated with CSR (or CSC, or COO) sparse matrix representation, the process is not complicated.
Suppose I have code like this (excerpted from the CUDA conjugate gradient sample):
checkCudaErrors(cudaMalloc((void **)&d_val, nz*sizeof(float)));
...
cudaMemcpy(d_val, val, nz*sizeof(float), cudaMemcpyHostToDevice);
Now, subsequent to this point in the code, let's suppose I need to change some values in the d_val array, corresponding to changes I have made in val:
for (int i = 10; i < 25; i++)
val[i] = 4.0f;
The process to move these particular changes is conceptually the same as if you were updating an array using memcpy, but we will use cudaMemcpy to update the d_val array on the device:
cudaMemcpy(d_val+10, val+10, 15*sizeof(float), cudaMempcyHostToDevice);
Since these values were all contiguous, I can use a single cudaMemcpy call to effect the transfer.
If I have several disjoint regions similar to above, it will require several calls to cudaMemcpy, one for each region. If, by chance, the regions are equally spaced and of equal length:
for (int i = 10; i < 5; i++)
val[i] = 1.0f;
for (int i = 20; i < 5; i++)
val[i] = 2.0f;
for (int i = 30; i < 5; i++)
val[i] = 4.0f;
then it would also be possible to perform this transfer using a single call to cudaMemcpy2D. The method is outlined here.
Notes:
cudaMemcpy2D is slower than you might expect compared to a cudaMemcpy operation on the same number of elements.
CUDA API calls have some inherent overhead. If a large part of the matrix is to be updated in a scattered fashion, it may still be actually quicker to just transfer the whole d_val array, taking advantage of the fact that this can be done using a single cudaMemcpy operation.
The method described here cannot be used if non-zero values change their location in the sparse matrix. In that case, I cannot provide a general answer for how to surgically update a CSR sparse matrix on the device. And certain relatively simple changes could necessitate updating most of the array data (3 vectors) anyway.

Access a matrix rapidly

I need to access a two-dimensions matrix with a C++ code. If the matrix is mat[n][m], I have to access (in a for-loop) these positions:
mat[x][y], mat[x-1][y-m-1], mat[x-1][y], mat[x][y-1]
At the next iteration I have to do:
x=x+1
And then, again:
mat[x][y], mat[x-1][y-m-1], mat[x-1][y], mat[x][y-1]
What could be the best way to have these positions nearest in memory to speedup my code?
If you are iterating horizontally, arrange your matrix as mat[y][x], especially if it is an array of arrays (the layout of the matrix isn't clear in your answer).
Since you didn't provided sufficient information, it's hard to say which way is better.
You could try to unroll your loop for continuous memory access.
For example, read from mat[x][y] 4 times, then mat[x-1][y-m-1] 4 times, then mat[x-1][y] 4 times, then mat[x][y-1] 4 times. After that, you process the loaded 4 sets of data in one iteration.
I bet the bottleneck is not the memory access itself. It should be the calculation of memory address. This approach of memory access can be written in SIMD load so you could reduce 3/4 time cost of memory address calculating.
If you have to process your task sequentially, you could try not to use multidimensional subscribes. For example:
for( x=0; x<n; x++ )
doSomething( mat[x][y] );
could be done with:
for( x=y; x<n*m; x+=m )
doSomething( mat[0][x] );
In second way you avoided one lea instruction.
If I get this right, you loop through your entire array, although you only mention x = x + 1 as an update (nothing for y). I would then see the array as one-dimensional, with a single counter i going from 0 to the total array length. Then the four values to access in each loop would be
mat[i], mat[i-S-m-1], mat[i-S], mat[i-1]
where S is the stride (rows or columns depending on your representation). This requires less address computations, regardless of memory layout. It also takes less index checks/updates because there's only one counter i. Plus, S+m+1 is constant, so you could define it as such.

iterating through matrix is slower when changing A[i][j] to A[j][i] [duplicate]

This question already has answers here:
c++ 2d array access speed changes based on [a][b] order? [duplicate]
(5 answers)
Closed 10 years ago.
I have a matrix of ints named A, and when I'm iterating trough it by columns instead of rows, it runs about 50 ms slower:
for(int i=0;i<n;i++)
for(int j=0;j<n;j++)
cout<<A[j][i]; //slower than of A[i][j]
Does anyone know why does this happen? I've asked a couple of people, but none of them knew why. I'm sure it's related to how the addresses are represented in computer's memory, but still, I'd like to find a more concrete answer.
Iterating through the matrix row by row is faster because of cache memory.
When you access A[i][j], there is more memory being loaded into the cache than just one element. Note that each row of your matrix is stored within continuous memory block, thus when the memory "around" A[i][j] is still in cache, it is more probable that accessing next element within the same row will result in it being read from cache rather than main memory (see cache miss).
Also see related questions:
Why does the order of the loops affect performance when iterating over a 2D array?
Which of these two for loops is more efficient in terms of time and cache performance
How cache memory works?
Matrix multiplication: Small difference in matrix size, large difference in timings
A 2D array is stored in memory as a 1D array, in (row/column) major. What this means is that an array with 5 columns, might be stored as 5 columns one after the other, so based on how you access vs this ordering, your accesses might be cached, or every one of them might cause cache fails, causing the large difference in performance.
It's about cache line read mechanism.
Read about spatial locality.
To verify, try to disable cache while running this application. (I forgot how to do this, but it can be done.)
AS others have noted, it is a cache issue. Using it one way may cause a cache miss each time an array element is accessed.
The cache issue is actually a very important factor for optimizations. It's the reason why it is sometimes better to do a structure of arrays instead of array of structures. Compare these two:
struct StructOfArrays {
int values[100];
char names[100][100];
}
struct StructOfArrays values;
struct NormalValStruct {
int val;
char name[100];
}
struct NormalValStruct values[100];
If you iterate over values in StructOfArrays they will be probably loaded into cache and read efficiently. When you iterate over NormalValStruct and get the value member, you will get a cache miss every other time.
That trick is often used in high-performance applications. Which are, often, games.
Because the first loop accesses memory linear the other with gaps in between. Thus the first loop is friendlier for the cache.

C++ How to force prefetch data to cache? (array loop)

I have loop like this
start = __rdtsc();
unsigned long long count = 0;
for(int i = 0; i < N; i++)
for(int j = 0; j < M; j++)
count += tab[i][j];
stop = __rdtsc();
time = (stop - start) * 1/3;
Need to check how prefetch data influences on efficiency. How to force prefetch some values from memory into cache before they will be counted?
For GCC only:
__builtin_prefetch((const void*)(prefetch_address),0,0);
prefetch_address can be invalid, there will be no segfault. If there too small difference between prefetch_address and current location, there might be no effect or even slowdown. Try to set it at least 1k ahead.
First, I suppose that tab is a large 2D array such as a static array (e.g., int tab[1024*1024][1024*1024]) or a dynamically-allocated array (e.g., int** tab and following mallocs). Here, you want to prefetch some data from tab to the cache to reduce the execution time.
Simply, I don't think that you need to manually insert any prefetching to your code, where a simple reduction for a 2D array is performed. Modern CPUs will do automatic prefetching if necessary and profitable.
Two facts you should know for this problem:
(1) You are already exploit the spatial locality of tab inside of the innermost loop. Once tab[i][0] is read (after a cache miss, or a page fault), the data from tab[i][0] to tab[i][15] will be in your CPU caches, assuming that the cache line size is 64 bytes.
(2) However, when the code traverses in the row, i.e., tab[i][M-1] to tab[i+1][0], it is highly likely to happen a cold cache miss, especially when tab is a dynamically-allocated array where each row could be allocated in a fragmented way. However, if the array is statically allocated, each row will be located contiguously in the memory.
So, prefetching makes a sense only when you read (1) the first item of the next row and (2) j + CACHE_LINE_SIZE/sizeof(tab[0][0]) ahead of time.
You may do so by inserting a prefetch operation (e.g., __builtin_prefetch) in the upper loop. However, modern compilers may not always emit such prefetch instructions. If you really want to do that, you should check the generated binary code.
However, as I said, I do not recommend you do that because modern CPUs will mostly do prefetching automatically, and that automatic prefetching will mostly outperform your manual code. For instance, an Intel CPU like Ivy Bridge processors, there are multiple data prefetchers such as prefetching to L1, L2, or L3 cache. (I don't think mobile processors have a fancy data prefetcher though). Some prefetchers will load adjacent cache lines.
If you do more expensive computations on large 2D arrays, there are many alternative algorithms that are more friendly to caches. A notable example would be blocked(titled) matrix multiply. A naive matrix multiplication suffers a lot of cache misses, but a blocked algorithm significantly reduces cache misses by calculating on small subsets that are fit to caches. See some references like this.
The easiest/most portable method is to simply read some data every cacheline bytes apart. Assuming tab is a proper two-dimensional array, you could:
char *tptr = (char *)&tab[0][0];
tptr += 64;
char temp;
volatile char keep_temp_alive;
for(int i = 0; i < N; i++)
{
temp += *tptr;
tptr += 64;
for(j = 0; j < M; j++)
count += tab[i][j];
}
keep_temp_alive = temp;
Something like that. However, it does depend on:
1. You don't end up reading outside the allocated memory [by too much].
2. the J loop is not that much larger than 64 bytes. If it is, you may want to add more steps of temp += *tptr; tptr += 64; in the begginning of the loop.
The keep_temp_alive after the loop is essential to prevent the compiler from completely removing temp as unnecessary loads.
Unfortunately, I'm too slow writing generic code to suggest the builtin instructions, the points for that goes to Leonid.
The __builtin_prefetch instruction is pretty helpful, but is clang/gcc specific. If you are compiling to multiple compiler targets, I had luck using the x86 intrinsic _mm_prefetch with both clang and MSVC.
https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm_prefetch

c++ 2d array access speed changes based on [a][b] order? [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Why is my program slow when looping over exactly 8192 elements?
I have been tinkering around with a program that I'm using to simply sum the elements of a 2d array. A typo led to what seem to me at least, some very strange results.
When dealing with array, matrix[SIZE][SIZE]:
for(int row = 0; row < SIZE; ++row)
for(int col = 0; col < SIZE; ++col)
sum1 += matrix[row][col];
Runs very quickly, however is the above line sum1... is modified:
sum2 += matrix[col][row]
As I did once on accident without realizing it, I notice that my runtime increases SIGNIFICANTLY. Why is this?
This is due to caching behaviour of your program.
Arrays are just consecutive blocks of memory, so when you access [row][column] you are accessing the memory sequentially. This means the data page you are accessing is on the same page, so the access is much faster.
When you do [column][row], you aren't accessing that memory sequentially anymore, so you will end up with more cache misses, so your program runs much slower.
The memory locations of matrix[row][col] and matrix[row][col + 1] are adjacent.
The memory locations of matrix[row][col] and matrix[row + 1][col] are separated by SIZE amount of items.
Computers like accessing memory SEQUENTIALLY not RANDOMLY, thus the adjacent access is faster. For an analogy think hard drive performance, sequential read/write is always better than random read/write. This has to do with how your CPU caches memory and tries to predict what you'll need next.
It's because in the quicker case the CPU's memory prefetching is actually useful as you're iterating in a linear fashion. In the slow case you're jumping around the memory and so prefetching has little effect as the data is unlikely to be in the cache.
It depends on how the matrix is ordered. You are accessing the array either in row-major or column-major. Depending on how it is stored in memory, the speed will be different between the two
2d array is just pointer to pointer. So it's look like
[*p][*p][*p]
| | |
v v v
[d] [d] [d]
|a| |a| |a|
|t| |t| |t|
[a] [a] [a]
So when you call data on non-main-array (what this pointers indicate on) your OS put it to CPU cache.