Access a matrix rapidly - c++

I need to access a two-dimensions matrix with a C++ code. If the matrix is mat[n][m], I have to access (in a for-loop) these positions:
mat[x][y], mat[x-1][y-m-1], mat[x-1][y], mat[x][y-1]
At the next iteration I have to do:
x=x+1
And then, again:
mat[x][y], mat[x-1][y-m-1], mat[x-1][y], mat[x][y-1]
What could be the best way to have these positions nearest in memory to speedup my code?

If you are iterating horizontally, arrange your matrix as mat[y][x], especially if it is an array of arrays (the layout of the matrix isn't clear in your answer).

Since you didn't provided sufficient information, it's hard to say which way is better.
You could try to unroll your loop for continuous memory access.
For example, read from mat[x][y] 4 times, then mat[x-1][y-m-1] 4 times, then mat[x-1][y] 4 times, then mat[x][y-1] 4 times. After that, you process the loaded 4 sets of data in one iteration.
I bet the bottleneck is not the memory access itself. It should be the calculation of memory address. This approach of memory access can be written in SIMD load so you could reduce 3/4 time cost of memory address calculating.
If you have to process your task sequentially, you could try not to use multidimensional subscribes. For example:
for( x=0; x<n; x++ )
doSomething( mat[x][y] );
could be done with:
for( x=y; x<n*m; x+=m )
doSomething( mat[0][x] );
In second way you avoided one lea instruction.

If I get this right, you loop through your entire array, although you only mention x = x + 1 as an update (nothing for y). I would then see the array as one-dimensional, with a single counter i going from 0 to the total array length. Then the four values to access in each loop would be
mat[i], mat[i-S-m-1], mat[i-S], mat[i-1]
where S is the stride (rows or columns depending on your representation). This requires less address computations, regardless of memory layout. It also takes less index checks/updates because there's only one counter i. Plus, S+m+1 is constant, so you could define it as such.

Related

How does two array lookup codes have different speeds [duplicate]

This question already has answers here:
Accessing elements of a matrix row-wise versus column-wise
(3 answers)
Closed 4 years ago.
I have an array, long matrix[8*1024][8*1024], and two functions sum1 and sum2:
long sum1(long m[ROWS][COLS]) {
long register sum = 0;
int i,j;
for (i=0; i < ROWS; i++) {
for (j=0; j < COLS; j++) {
sum += m[i][j];
}
}
return sum;
}
long sum2(long m[ROWS][COLS]) {
long register sum = 0;
int i,j;
for (j=0; j < COLS; j++) {
for (i=0; i < ROWS; i++) {
sum += m[i][j];
}
}
return sum;
}
When I execute the two functions with the given array, I get running times:
sum1: 0.19s
sum2: 1.25s
Can anyone explain why there is this huge difference?
C uses row-major ordering to store multidimensional arrays, as documented in § 6.5.2.1 Array subscripting, paragraph 3 of the C Standard:
Successive subscript operators designate an element of a multidimensional array object. If E is an n-dimensional array (n >= 2) with dimensions i x j x . . . x k, then E (used as other than an lvalue) is converted to a pointer to an (n - 1)-dimensional array with dimensions j x . . . x k. If the unary * operator is applied to this pointer explicitly, or implicitly as a result of subscripting, the result is the referenced (n - 1)-dimensional array, which itself is converted into a pointer if used as other than an lvalue. It follows from this that arrays are stored in row-major order (last subscript varies fastest).
Emphasis mine.
Here's an image from Wikipedia that demonstrates this storage technique compared to the other method for storing multidimensional arrays, column-major ordering:
The first function, sum1, accesses data consecutively per how the 2D array is actually represented in memory, so the data from the array is already in the cache. sum2 requires fetching of another row on each iteration, which is less likely to be in the cache.
There are some other languages that use column-major ordering for multidimensional arrays; among them are R, FORTRAN and MATLAB. If you wrote equivalent code in these languages you would observe faster output with sum2.
Computers generally use cache to help speed up access to main memory.
The hardware usually used for main memory is relatively slow—it can take many processor cycles for data to come from main memory to the processor. So a computer generally includes a smaller amount very fast but expensive memory called cache. Computers may have several levels of cache, some of it is built into the processor or the processor chip itself and some of it is located outside the processor chip.
Since the cache is smaller, it cannot hold everything in main memory. It often cannot even hold everything that one program is using. So the processor has to make decisions about what is kept in cache.
The most frequent accesses of a program are to consecutive locations in memory. Very often, after a program reads element 237 of an array, it will soon read 238, then 239, and so on. It is less often that it reads 7024 just after reading 237.
So the operation of cache is designed to keep portions of main memory that are consecutive in cache. Your sum1 program works well with this because it changes the column index most rapidly, keeping the row index constant while all the columns are processed. The array elements it accesses are laid out consecutively in memory.
Your sum2 program does not work well with this because it changes the row index most rapidly. This skips around in memory, so many of the accesses it makes are not satisfied by cache and have to come from slower main memory.
Related Resource: Memory layout of multi-dimensional arrays
On a machine with data cache (even a 68030 has one), reading/writing data in consecutive memory locations is way faster, because a block of memory (size depends on the processor) is fetched once from memory and then recalled from the cache (read operation) or written all at once (cache flush for write operation).
By "skipping" data (reading far from the previous read), the CPU has to read the memory again.
That's why your first snippet is faster.
For more complex operations (fast fourier transform for instance), where data is read more than once (unlike your example) a lot of libraries (FFTW for instance) propose to use a stride to accomodate your data organization (in rows/in columns). Never use it, always transpose your data first and use a stride of 1, it will be faster than trying to do it without transposition.
To make sure your data is consecutive, never use 2D notation. First position your data in the selected row and set a pointer to the start of the row, then use an inner loop on that row.
for (i=0; i < ROWS; i++) {
const long *row = m[i];
for (j=0; j < COLS; j++) {
sum += row[j];
}
}
If you cannot do this, that means that your data is wrongly oriented.
This is an issue with the cache.
The cache will automatically read data that lies after the data you requested. So if you read the data row by row, the next data you request will already be in the cache.
A matrix, in memory, is align linearly, such that the items in a row are next to each other in memory (spacial locality). When you transverse items in order such that you go through all of the columns in a row before moving onto the next one, when the CPU comes across an entry that isn't loaded into its cache yet, it will go and load that value along with a whole block of other values close to it in physical memory so the next several values will already be cached by the time it needs to read them.
When you transverse them the other way, the other values it loads in that are near it in memory are not going to be the next ones read, so you wind up with a lot more cache misses and so the CPU has to sit and wait while the data is brought in from the next layer of the memory hierarchy.
By the time you swing back around to another entry that you had previously cached, it more than likely has been booted out of the cache in favor of all the other data you've since loaded in as it will not have been recently used anymore (temporal locality)
To expand on the other answers that this is due to cache-misses for the second program, and assuming that you are using Linux, *BSD, or MacOS, then Cachegrind may give you enlightenment. It's part of valgrind, and will run your program, without changes, and print the cache usage statistics. It does run very slowly though.
http://valgrind.org/docs/manual/cg-manual.html

C++ - Performance of static arrays, with variable size at launch

I wrote a cellular automaton program that stores data in a matrix (an array of arrays). For a 300*200 matrix I can achieve 60 or more iterations per second using static memory allocation (e.g. std::array).
I would like to produce matrices of different sizes without recompiling the program every time, i.e. the user enters a size and then the simulation for that matrix size begins. However, if I use dynamic memory allocation, (e.g. std::vector), the simulation drops to ~2 iterations per second. How can I solve this problem? One option I've resorted to is to pre-allocate a static array larger than what I anticipate the user will select (e.g. 2000*2000), but this seems wasteful and still limits user choice to some degree.
I'm wondering if I can either
a) allocate memory once and then somehow "freeze" it for ordinary static array performance?
b) or perform more efficient operations on the std::vector? For reference, I am only performing matrix[x][y] == 1 and matrix[x][y] = 1 operations on the matrix.
According to this question/answer, there is no difference in performance between std::vector or using pointers.
EDIT:
I've rewritten the matrix, as per UmNyobe' suggestion, to be a single array, accessed via matrix[y*size_x + x]. Using dynamic memory allocation (sized once at launch), I double the performance to 5 iterations per second.
As per PaulMcKenzie's comment, I compiled a release build and got the performance I was looking for (60 or more iterations per second). However, this is the foundation for more, so I still want to quantify the benefit of one method over the other more thoroughly, so I used a std::chrono::high_resolution_clock to time each iteration, and found that the performance difference between dynamic and static arrays (after using a single array matrix representation) to be within the margin of error (450~600 microseconds per iteration).
The performance during debugging is a slight concern however, so I think I'll keep both, and switch to a static array when debugging.
For reference, I am only performing
matrix[x][y]
Red flag! Are you using vector<vector<int>>for your matrix
representation? This is a mistake, as rows of your matrix will be far
apart in memory. You should use a single vector of size rows x cols
and use matrix[y * row + x]
Furthermore, you should follow the approach where you index first by rows and then by columns, ie matrix[y][x] rather than matrix[x][y]. Your algorithm should also process the same way. This is due to the fact that with matrix[y][x] (x, y) and (x + 1, y) are one memory block from each other while with any other mechanism elements (x,y) and (x + 1, y), (x, y + 1) are much farther away.
Even if there is performance decrease from std::array to std::vector (as the array can have its elements on the stack, which is faster), a decent algorithm will perform on the same magnitude using both collections.

CUDA - Understanding parallel execution of threads (warps) and coalesced memory access

I just started to code in CUDA and I'm trying to get my head around the concepts of how threads are executed and memory accessed in order to get the most out of the GPU. I read through the CUDA best practice guide, the book CUDA by Example and several posts here. I also found the reduction example by Mark Harris quite interesting and useful, but despite all the information I got rather confused on the details.
Let's assume we have a large 2D array (N*M) on which we do column-wise operations. I split the array into blocks so that each block has a number of threads that is a multiple of 32 (all threads fit into several warps). The first thread in each block allocates additional memory (a copy of the initial array, but only for the size of its own dimension) and shares the pointer using a _shared _ variable so that all threads of the same block can access the same memory. Since the number of threads is a multiple of 32, so should be the memory in order to be accessed in a single read. However, I need to have an extra padding around the memory block, a border, so that the width of my array becomes (32*x)+2 columns. The border comes from decomposing the large array, so that I have an overlapping areas in which a copy of its neighbours is temporarily available.
Coeleased memory access:
Imagine the threads of a block are accessing the local memory block
1 int x = threadIdx.x;
2
3 for (int y = 0; y < height; y++)
4 {
5 double value_centre = array[y*width + x+1]; // remeber we have the border so we need an offset of + 1
6 double value_left = array[y*width + x ]; // hence the left element is at x
7 double value_right = array[y*width + x+2]; // and the right element at x+2
8
9 // .. do something
10 }
Now, my understanding is that since I do have an offset (+1,+2), which is unavoidable, I will have at least two reads per warp and per assignment (except for the left elements), or does it not matter from where I start reading as long as the memory after the 1st thread is perfectly aligned? Note also, if that is not the case then I would have unaligned access to the array for each row after the first one, since the width of my array is (32*x)+2, and hence not 32-byte aligned. A further padding would however solve the problem for each new row.
Question: Is my understanding correct that in the example above only the first row would allow coeleased access and only for the left element in the array, since that is the only one which is accessed without any offset?
Thread executed in a warp:
Threads in a warp are only executed in parallel if and only if all the instructions are the same (according to link). If I do have a conditional statement / diverging execution, then that particular thread will be executed by itself and not within a warp with the others.
For example if I initialise the array I could do something like
1 int x = threadIdx.x;
2
3 array[x+1] = globalArray[blockIdx.x * blockDim.x + x]; // remember the border and therefore use +1
4
5 if (x == 0 || x == blockDim.x-1) // border
6 {
7 array[x] = DBL_MAX;
8 }
Will the warp be of size 32 and executed in parallel until line 3 and then stop for all other threads and only the first and last thread further executed to initialise the border, or will those be separated from all other threads already at the beginning, since there is an if statement that all other threads do not fulfill?
Question: How are threads collected into a single warp? Each thread in a warp needs to share the same instructions. Need this to be valid for the whole function? This is not the case for thread 1 (x=0), since it initialises also the border and therefore is different from others. To my understanding, thread 1 is executed in a single warp, thread (2-33, etc.) in another warp, which then doesn't access the memory in a singe read, due to miss-alignment, and then again the final thread in a single warp due to the other border. Is that correct?
I wonder what the best practice is, to have either memory perfectly aligned for each row (in which case I would run each block with (32*x-2) threads so that the array with border is (32*x-2)+2 a multiple of 32 for each new line) or do it the way I had demonstrated above, with threads a multiple of 32 for each block and just live with the unaligned memory. I am aware that these sort of questions are not always straightforward and often depend on particular cases, but sometimes certain things are a bad practice and should not become habit.
When I experimented a little bit, I didn't really notice a difference in execution time, but maybe my examples were just too simple. I tried to get information from the visual profiler, but I haven't really understood all the information it gives me. I got however a warning that my occupancy level is at 17%, which I think must be really low and therefore there is something I do wrong. I didn't manage to find information on how threads are executed in parallel and how efficient my memory access is.
-Edit-
Added and highlighted 2 questions, one about memory access, the other one about how threads are collected to a single warp.
Now, my understanding is that since I do have an offset (+1,+2), which is unavoidable, I will have at least two reads per warp and per assignment (except for the left elements), or does it not matter from where I start reading as long as the memory after the 1st thread is perfectly aligned?
Yes, it does matter "from where you start reading" if you are trying to achieve perfect coalescing. Perfect coalescing means the read activity for a given warp and a given instruction all comes from the same 128-byte aligned cacheline.
Question: Is my understanding correct that in the example above only the first row would allow coeleased access and only for the left element in the array, since that is the only one which is accessed without any offset?
Yes. For cc2.0 and higher devices, the cache(s) may mitigate some of the drawbacks of unaligned access.
Question: How are threads collected into a single warp? Each thread in a warp needs to share the same instructions. Need this to be valid for the whole function? This is not the case for thread 1 (x=0), since it initialises also the border and therefore is different from others. To my understanding, thread 1 is executed in a single warp, thread (2-33, etc.) in another warp, which then doesn't access the memory in a singe read, due to miss-alignment, and then again the final thread in a single warp due to the other border. Is that correct?
The grouping of threads into warps always follows the same rules, and will not vary based on the specifics of the code you write, but is only affected by your launch configuration. When you write code that not all the threads will participate in (such as in your if statement), then the warp still proceeds in lockstep, but the threads that do not participate are idle. When you are filling in borders like this, it's rarely possible to get perfectly aligned or coalesced reads, so don't worry about it. The machine gives you that flexibility.

iterating through matrix is slower when changing A[i][j] to A[j][i] [duplicate]

This question already has answers here:
c++ 2d array access speed changes based on [a][b] order? [duplicate]
(5 answers)
Closed 10 years ago.
I have a matrix of ints named A, and when I'm iterating trough it by columns instead of rows, it runs about 50 ms slower:
for(int i=0;i<n;i++)
for(int j=0;j<n;j++)
cout<<A[j][i]; //slower than of A[i][j]
Does anyone know why does this happen? I've asked a couple of people, but none of them knew why. I'm sure it's related to how the addresses are represented in computer's memory, but still, I'd like to find a more concrete answer.
Iterating through the matrix row by row is faster because of cache memory.
When you access A[i][j], there is more memory being loaded into the cache than just one element. Note that each row of your matrix is stored within continuous memory block, thus when the memory "around" A[i][j] is still in cache, it is more probable that accessing next element within the same row will result in it being read from cache rather than main memory (see cache miss).
Also see related questions:
Why does the order of the loops affect performance when iterating over a 2D array?
Which of these two for loops is more efficient in terms of time and cache performance
How cache memory works?
Matrix multiplication: Small difference in matrix size, large difference in timings
A 2D array is stored in memory as a 1D array, in (row/column) major. What this means is that an array with 5 columns, might be stored as 5 columns one after the other, so based on how you access vs this ordering, your accesses might be cached, or every one of them might cause cache fails, causing the large difference in performance.
It's about cache line read mechanism.
Read about spatial locality.
To verify, try to disable cache while running this application. (I forgot how to do this, but it can be done.)
AS others have noted, it is a cache issue. Using it one way may cause a cache miss each time an array element is accessed.
The cache issue is actually a very important factor for optimizations. It's the reason why it is sometimes better to do a structure of arrays instead of array of structures. Compare these two:
struct StructOfArrays {
int values[100];
char names[100][100];
}
struct StructOfArrays values;
struct NormalValStruct {
int val;
char name[100];
}
struct NormalValStruct values[100];
If you iterate over values in StructOfArrays they will be probably loaded into cache and read efficiently. When you iterate over NormalValStruct and get the value member, you will get a cache miss every other time.
That trick is often used in high-performance applications. Which are, often, games.
Because the first loop accesses memory linear the other with gaps in between. Thus the first loop is friendlier for the cache.

Describing a very long matrix by a vector of vectors, which dimension should be the largest?

I'm writing code that uses a large matrix where the elements are a user defined class. To build this matrix, I use the following vector of vectors.
using namespace std;
vector< vector< userclass > > matrix = vector<vector<userclass> >(sizeX, vector<userclass>(sizeY));
This class, which might also be a struct, will contain a few builtins such as floats and pointers. So here's the thing:
Let's say the matrix will have a size of 2000 in one direction, but only a size of 20 in the other, but I have total freedom to choose which one. For the best performance, which one should I make the biggest, sizeX or sizeY?
In other words: which is faster, a small vector of large vectors, or a large vector of small vectors? Is there a difference at all?
The performance optimization should be towards single random accesses.
You should aim for the least number of vectors possible, which means that sizeY should be larger than sizeX for best cache performance, not to mention taking up less space.
Of course, it depends on how you intend to use them. If you can, try to stay accessing a vector for as long as possible - vec[i][j] is much better than vec[j][i]. If you have to do vec[j][i] then having sizeX be larger may have better performance, or using 1 contiguous array.
Fastest iterating where sizeX > sizeY:
for(int i...)
for(int j...) {
vec[i][j];
}
There are different things to consider here. The first of all is that you are probably better off defining your own matrix type that holds a single vector of data of size sizeX*sizeY together with operators that map the coordinates to the location of the element in the vector.
The advantage of this approach is that the memory footprint will be more compact (less memory used1) and memory will be contiguous.
As of how that mapping should be done, and considering mainly performance, it depends on the usage of the data. If you are going to iterate in a particular direction, you want to make consecutive elements in that direction take contiguous positions in memory (i.e. if you are going to iterate with an external loop over the Y and an internal loop over X, then the formula should be pos = y * sizeX + x.
1 Assuming that the type takes 10 bytes, a vector of 2000 vectors of 20 elements takes (2000+1)*sizeof(vector) + 2000*20*10 bytes, a vector of 20 vectors of 2000 elements will take approximately (20+1)*sizeof(vector) + 2000*20*10 bytes, and a single vector of 2000*20 elements takes sizeof(vector)+2000*20*10 bytes. Roughly in a 64bit platform in release with no extra debugging information, sizeof(vector<X>) ~ 3*8 (i.e. 24 bytes), and the totals would be: 448024, 400504 and 400024 bytes. That might not make that much of a difference, but in the first case there is an extra 10% memory in use, compared to the optimal case.