Cache-Friendly Matrix, for access to adjacent cells - c++

Current Design
In my program I have a big 2-D grid (1000 x 1000, or more), each cell contains a small information.
In order to represent this concept the choice is quite trivial: a matrix data structure.
The correspondent code (in C++) is something like:
int w_size_grid = 1000;
int h_size_grid = 1000;
int* matrix = new int[w_size_grid * h_size_grid];
As you can notice I've used a vector, but the principle is the same.
In order to access an element of the grid, we need a function that given a cell in the grid, identified by (x,y), it returns the value stored in that cell.
Mathematically:
f(x,y) -> Z
obviously:
f: Z^2 -> Z where Z is the set of integer numbers.
That can be trivially achieved with a linear function. Here a C++ code representation:
int get_value(int x, int y) {
return matrix[y*w_size_grid + x];
}
Additional Implementation Notes
Actually the design concept requires a sort of "circular-continuous grid": the access indices for the cell can go out the limits of the grid itself.
That means, for example, the particular case: get_value(-1, -1); is still valid. The function will just return the same value as get_value(w_size_grid - 1, h_size_grid -1);.
Actually this is no a problem in the implementation:
int get_value(int x, int y) {
adjust_xy(&x, &y); // modify x and y in accordance with that rule.
return matrix[y*w_size_grid + x];
}
Anyway this is just an additional note in order to make the scenario more clear.
What is the problem?
The problem presented above is very trivial and simple to design and to implement.
My problem comes with the fact that the matrix is updated with an high frequency. Each cell in the matrix is read and possibly updated with a new value.
Obviously the parsing of the matrix is done with two loop in according to a cache-friend design:
for (int y = 0; y < h_size_grid; ++y) {
for (int x = 0; x < w_size_grid; ++x) {
int value = get_value(x, y);
}
}
The inner cycle is x since [x-1] [x] [x+1] are stored contiguously. Indeed, that cycle exploits principle of locality.
The problem comes now because, actually in order to update a value in a cell, it depends on values in the adjacent cells.
Each cell has exactly eight neighbours, which are the cells that are horizontally, vertically, or diagonally adjacent.
(-1,-1) | (0,-1) | (1,-1)
-------------------------
(-1,0) | (0,0) | (0, 1)
-------------------------
(-1,1) | (0,1) | (1,1)
So the code is intuitively:
for (int y = 0; y < h_size_grid; ++y) {
for (int x = 0; x < w_size_grid; ++x) {
int value = get_value(x, y);
auto values = get_value_all_neighbours(x, y); // values are 8 integer
}
}
The function get_value_all_neighbours(x,y) will access one row up and one row down in the matrix relatively to y.
Since a row in the matrix is quite big, it involves a cache miss and it dirties the caches themselves.
The Question
One I have finally present you the scenario and the problem, my question is how to "solve" the problem.
Using some additional data structures, or reorganizing the data is there a way to exploit the caches and to avoid all those miss?
Some Personal Consideration
My feelings guide me toward a strategic data structure.
I've thought about a reimplementation of the order in which the values are stored in the vector, trying to stored in contiguous indices those cell which are neighbours.
That implies a no-more-linear function for get_value.
After some thinking, I believe is not possible to find this no-linear function.
I've also thought some additional data-structure like hash-table to store adjacent value for each cell, but I think is an overkill more in space and maybe in CPU cycle also.

Lets assume you have indeed a problem with cache misses that can't easily be avoided (referring to other answers here).
You could use a space filling curve to organize your data in a cache friendly way. Essentially, space filling curves map a volume or plane (such as your matrix) to a linear representations, such that values that are close together in space end (mostly) up close together in the linear representation. In effect, if you store the matrix in a z-ordered array, neighbouring elements have a high likelihood of being on the same memory page.
The best proximity mapping is available with the Hilbert Curve, but it is expensive to calculate. A better option may be a z-curve (Morton-Order). It provides good proximity and is fast to calculate.
Z-Curve: Essentially, to get the ordering, you have to interleave the bits of your x and y coordinate into a single value, called 'z-value'. This z-value determines the position in your list of matrix values (or even simply the index in an array if you use an array). The z-values are consecutive for a completely filled matrix (every cell is used). Conversely, you can de-interleave the position in the list (=array index) and get your x/y coordinates back.
Interleaving bits of two values is quite fast, there are even dedicated CPU instructions to do this with few cycles. If you can't find these (I can't, at the moment), you can simply use some bit twiddling tricks.

Actually, the data structure is not trivial, especially when optimizations are concerned.
There are two main issues to resolve: data content and data usage. Data content are the values in the data and the usage is how the data is stored, retrieved and how often.
Data Content
Are all the values accessed? Frequently?
Data that is not accessed frequently can be pushed to slower media, including files. Leave the fast memory (such as data caches) for the frequently accessed data.
Is the data similar? Are there patterns?
There are alternative methods for representing matrices where a lot of the data is the same (such as a sparse matrix or a lower triangular matrix). For large matrices, maybe performing some checks and returning constant values may be faster or more efficient.
Data Usage
Data usage is a key factor in determining an efficient structure for the data. Even with matrices.
For example, for frequently access data, a map or associative array may be faster.
Sometimes, using many local variables (i.e. registers) may be more efficient for processing matrix data. For example, load up registers with values first (data fetches), operate using the registers, then store the registers back into memory. For most processors, registers are the fastest media for holding data.
The data may want to be rearranged to make efficient use of data cache's and cache lines. The data cache is a high speed area of memory very close to the processor core. A cache line is one row of data in the data cache. An efficient matrix can fit one or more row per cache line.
The more efficient method is to perform as many accesses to a data cache line as possible. Prefer to reduce the need to reload the data cache (because an index was out of range).
Can the operations be performed independently?
For example, scaling a matrix, where each location is multiplied by a value. These operations don't depend on other cells of the matrix. This allows the operations to be performed in parallel. If they can be performed in parallel, then they can be delegated to processors with multiple cores (such as GPUs).
Summary
When a program is data driven, the choice of data structures is not trivial. The content and usage are important factors when choosing a structure for the data and how the data is aligned. Usage and performance requirements will also determine the best algorithms for accessing the data. There are already many articles on the internet for optimizing for data driven applications and best usage of data caches.

Related

Sparse matrix storage format designed for row/col manipulation?

I am working with a program that needs to access and store data in a sparse matrix. About 40-60% of the matrix will be non-zero, dimensions are anywhere from 14K to 22K elements square.
Here's my problem- I'm going to be performing a lot of row and column manipulations. Mainly adding, removing, and swapping. I've looked at most of the existing well known sparse matrix formats (CRS, CCS, COO, the block formats, etc), and most of them don't seem very receptive to these kinds of operations. The moment you start adding and removing entire rows or columns, you land up having to update all the elements to either side of the manipulated row/col, and that's something I'd like to try and avoid if possible (it has occurred to me that you could probably manage elements in such a way that their coordinates in the matrix are actually stored as a pair of pointers to a common row or column index, and save yourself from manually updating thousands of elements by simply incrementing or decrementing that value).
Is there anything like that out there?
I would consider a layer of indirection that serves as an efficient mechanism for swapping, inserting, and deleting rows and columns.
This would be analogous to how virtual memory is managed in modern operating systems. This is just a brief aside: the physical RAM addresses get mapped by the CPU's memory management unit into a linear addressing space. The MMU, with the host O/S's help, maps the actual physical RAM addresses into each process's virtual address space. The addresses used by pointers, and other objects, in each process are not real RAM addresses, they are virtual, and are translated by the hardware MMU unit into actual physical RAM addresses. This is why idle processes can be paged out by the host operating system into a swap file or partition, when the system is short on space, and then loaded back in later, into some entirely different area of physical RAM, without the process even being aware of what happened.
Anyway, back on topic, this would be an analogous approach here.
Consider a plain, garden variety two-dimensional std::map. Here's your sparse matrix:
std::map<size_t, std::map<size_t, value_t>>
The first map's dimension, or key, is the row, which gives you a second map, whose dimension/key is the column, and which finally contains your value.
This is pretty straightforward. Nothing earth-shattering here. But as you understand: moving, swapping, and inserting rows and columns becomes quite a bear.
Ok, so let's introducing your own, personalized MMU: your map management unit. It will work, pretty much, like a hardware MMU:
std::map<size_t, size_t> rows;
std::map<size_t, size_t> columns;
So, in your example, to look up the value for virtual row R, columns C:
1) rows[R] gives you the "physical" row number.
2) columns[C] gives you the "physical" column number.
3) Now, you take the "physical" row and column numbers, then go to your two-dimensional std::map, and look up your value given the physical row and column.
So, what do we get here? Well, moving an entire row or column now involves a simple update to your map management unit. Remove a value from the rows or the columns map, and put it back with a different key, the new "logical" row or column. The values in the "physical", two-dimensional map, remain untouched.
That's it. Moving, swapping, or inserting rows becomes a simple operation on your mmu object.
There are two other details that will need to be taken care of:
A) Keep track of which physical rows and columns are not unused, and can be assigned to new logical rows and columns, when they get added.
B) Depending on your use case, it may also be necessary to map physical rows and columns back into the virtual row and column numbers.
Both of these are fairly trivial, straightforward tasks, which can be your homework assignment.

C++ - Performance of static arrays, with variable size at launch

I wrote a cellular automaton program that stores data in a matrix (an array of arrays). For a 300*200 matrix I can achieve 60 or more iterations per second using static memory allocation (e.g. std::array).
I would like to produce matrices of different sizes without recompiling the program every time, i.e. the user enters a size and then the simulation for that matrix size begins. However, if I use dynamic memory allocation, (e.g. std::vector), the simulation drops to ~2 iterations per second. How can I solve this problem? One option I've resorted to is to pre-allocate a static array larger than what I anticipate the user will select (e.g. 2000*2000), but this seems wasteful and still limits user choice to some degree.
I'm wondering if I can either
a) allocate memory once and then somehow "freeze" it for ordinary static array performance?
b) or perform more efficient operations on the std::vector? For reference, I am only performing matrix[x][y] == 1 and matrix[x][y] = 1 operations on the matrix.
According to this question/answer, there is no difference in performance between std::vector or using pointers.
EDIT:
I've rewritten the matrix, as per UmNyobe' suggestion, to be a single array, accessed via matrix[y*size_x + x]. Using dynamic memory allocation (sized once at launch), I double the performance to 5 iterations per second.
As per PaulMcKenzie's comment, I compiled a release build and got the performance I was looking for (60 or more iterations per second). However, this is the foundation for more, so I still want to quantify the benefit of one method over the other more thoroughly, so I used a std::chrono::high_resolution_clock to time each iteration, and found that the performance difference between dynamic and static arrays (after using a single array matrix representation) to be within the margin of error (450~600 microseconds per iteration).
The performance during debugging is a slight concern however, so I think I'll keep both, and switch to a static array when debugging.
For reference, I am only performing
matrix[x][y]
Red flag! Are you using vector<vector<int>>for your matrix
representation? This is a mistake, as rows of your matrix will be far
apart in memory. You should use a single vector of size rows x cols
and use matrix[y * row + x]
Furthermore, you should follow the approach where you index first by rows and then by columns, ie matrix[y][x] rather than matrix[x][y]. Your algorithm should also process the same way. This is due to the fact that with matrix[y][x] (x, y) and (x + 1, y) are one memory block from each other while with any other mechanism elements (x,y) and (x + 1, y), (x, y + 1) are much farther away.
Even if there is performance decrease from std::array to std::vector (as the array can have its elements on the stack, which is faster), a decent algorithm will perform on the same magnitude using both collections.

Efficient (time and space complexity) data structure for dense and sparse matrix

I have to read a file in which is stored a matrix with cars (1=BlueCar, 2=RedCar, 0=Empty).
I need to write an algorithm to move the cars of the matrix in that way:
blue ones move downward;
red ones move rightward;
there is a turn in which all the blue ones move and a turn to move all the red ones.
Before the file is read I don't know the matrix size and if it's dense or sparse, so I have to implement two data structures (one for dense and one for sparse) and two algorithms.
I need to reach the best time and space complexity possible.
Due to the unknown matrix size, I think to store the data on the heap.
If the matrix is dense, I think to use something like:
short int** M = new short int*[m];
short int* M_data = new short int[m*n];
for(int i=0; i< m; ++i)
{
M[i] = M_data + i * n;
}
With this structure I can allocate a contiguous space of memory and it is also simple to be accessed with M[i][j].
Now the problem is the structure to choose for the sparse case, and I have to consider also how I can move the cars through the algorithm in the simplest way: for example when I evaluate a car, I need to find easily if in the next position (downward or rightward) there is another car or if it's empty.
Initially I thought to define BlueCar and RedCar objects that inherits from the general Car object. In this objects I can save the matrix coordinates and then put them in:
std::vector<BluCar> sparseBlu;
std::vector<RedCar> sparseRed;
Otherwise I can do something like:
vector< tuple< row, column, value >> sparseMatrix
But the problem of finding what's in the next position still remains.
Probably this is not the best way to do it, so how can I implement the sparse case in a efficient way? (also using a unique structure for sparse)
Why not simply create a memory mapping directly over the file? (assuming your data 0,1,2 is stored in contiguous bytes (or bits) in the file, and the position of those bytes also represents the coordinates of the cars)
This way you don't need to allocate extra memory and read in all the data, and the data can simply and efficiently be accessed with M[i][j].
Going over the rows would be L1-cache friendly.
In case of very sparse data, you could scan through the data once and keep a list of the empty regions/blocks in memory (only need to store startpos and size), which you could then skip (and adjust where needed) in further runs.
With memory mapping, only frequently accessed pages are kept in memory. This means that once you have scanned for the empty regions, memory will only be allocated for the frequently accessed non-empty regions (all this will be done automagically by the kernel - no need to keep track of it yourself).
Another benefit is that you are accessing the OS disk cache directly. Thus no need to keep copying and moving data between kernel space and user space.
To further optimize space- and memory usage, the cars could be stored in 2 bits in the file.
Update:
I'll have to move cars with openMP and MPI... Will the memory mapping
work also with concurrent threads?
You could certainly use multithreading, but not sure if openMP would be the best solution here, because if you work on different parts of the data at the same time, you may need to check some overlapping regions (i.e. a car could move from one block to another).
Or you could let the threads work on the middle parts of the blocks, and then start other threads to do the boundaries (with red cars that would be one byte, with blue cars a full row).
You would also need a locking mechanism for adjusting the list of the sparse regions. I think the best way would be to launch separate threads (depending on the size of the data of course).
In a somewhat similar task, I simply made use of Compressed Row Storage.
The Compressed Row and Column (in the next section) Storage formats
are the most general: they make absolutely no assumptions about the
sparsity structure of the matrix, and they don't store any unnecessary
elements. On the other hand, they are not very efficient, needing an
indirect addressing step for every single scalar operation in a
matrix-vector product or preconditioner solve.
You will need to be a bit more specific about time and space complexity requirements. CSR requires an extra indexing step for simple operations, but that is a minor amount of overhead if you're just doing simple matrix operations.
There's already an existing C++ implementation available online as well.

How to change sub-matrix of a sparse matrix on CUDA device

I have a sparse matrix structure that I am using in conjunction with CUBLAS to implement a linear solver class. I anticipate that the dimensions of the sparse matrices I will be solving will be fairly large (on the order of 10^7 by 10^7).
I will also anticipate that the solver will need to be used many times and that a portion of this matrix will need be updated several times (between computing solutions) as well.
Copying the entire matrix sturcture from system memory to device memory could become quite a performance bottle neck since only a fraction of the matrix entries will ever need to be changed at a given time.
What I would like to be able to do is to have a way to update only a particular sub-set / sub-matrix rather than recopy the entire matrix structure from system memory to device memory each time I need to change the matrix.
The matrix data structure would reside on the CUDA device in arrays:
d_col, d_row, and d_val
On the system side I would have corresponding arrays I, J, and val.
So ideally, I would only want to change the subsets of d_val that correspond to the values in the system array, val, that changed.
Note that I do not anticipate that any entries will be added to or removed from the matrix, only that existing entries will change in value.
Naively I would think that to implement this, I would have an integer array or vector on the host side, e.g. updateInds , that would track the indices of entries in val that have changed, but I'm not sure how to efficiently tell the CUDA device to update the corresponding values of d_val.
In essence: how do I change the entries in a CUDA device side array (d_val) at indicies updateInds[1],updateInds[2],...,updateInds[n] to a new set of values val[updatInds[1]], val[updateInds[2]], ..., val[updateInds[3]], with out recopying the entire val array from system memory into CUDA device memory array d_val?
As long as you only want to change the numerical values of the value array associated with CSR (or CSC, or COO) sparse matrix representation, the process is not complicated.
Suppose I have code like this (excerpted from the CUDA conjugate gradient sample):
checkCudaErrors(cudaMalloc((void **)&d_val, nz*sizeof(float)));
...
cudaMemcpy(d_val, val, nz*sizeof(float), cudaMemcpyHostToDevice);
Now, subsequent to this point in the code, let's suppose I need to change some values in the d_val array, corresponding to changes I have made in val:
for (int i = 10; i < 25; i++)
val[i] = 4.0f;
The process to move these particular changes is conceptually the same as if you were updating an array using memcpy, but we will use cudaMemcpy to update the d_val array on the device:
cudaMemcpy(d_val+10, val+10, 15*sizeof(float), cudaMempcyHostToDevice);
Since these values were all contiguous, I can use a single cudaMemcpy call to effect the transfer.
If I have several disjoint regions similar to above, it will require several calls to cudaMemcpy, one for each region. If, by chance, the regions are equally spaced and of equal length:
for (int i = 10; i < 5; i++)
val[i] = 1.0f;
for (int i = 20; i < 5; i++)
val[i] = 2.0f;
for (int i = 30; i < 5; i++)
val[i] = 4.0f;
then it would also be possible to perform this transfer using a single call to cudaMemcpy2D. The method is outlined here.
Notes:
cudaMemcpy2D is slower than you might expect compared to a cudaMemcpy operation on the same number of elements.
CUDA API calls have some inherent overhead. If a large part of the matrix is to be updated in a scattered fashion, it may still be actually quicker to just transfer the whole d_val array, taking advantage of the fact that this can be done using a single cudaMemcpy operation.
The method described here cannot be used if non-zero values change their location in the sparse matrix. In that case, I cannot provide a general answer for how to surgically update a CSR sparse matrix on the device. And certain relatively simple changes could necessitate updating most of the array data (3 vectors) anyway.

What is the most efficient (yet sufficiently flexible) way to store multi-dimensional variable-length data?

I would like to know what the best practice for efficiently storing (and subsequently accessing) sets of multi-dimensional data arrays with variable length. The focus is on performance, but I also need to be able to handle changing the length of an individual data set during runtime without too much overhead.
Note: I know this is a somewhat lengthy question, but I have looked around quite a lot and could not find a solution or example which describes the problem at hand with sufficient accuracy.
Background
The context is a computational fluid dynamics (CFD) code that is based on the discontinuous Galerkin spectral element method (DGSEM) (cf. Kopriva (2009), Implementing Spectral Methods for Partial Differential Equations). For the sake of simplicity, let us assume a 2D data layout (it is in fact in three dimensions, but the extension from 2D to 3D should be straightforward).
I have a grid that consists of K square elements k (k = 0,...,K-1) that can be of different (physical) sizes. Within each grid element (or "cell") k, I have N_k^2 data points. N_k is the number of data points in each dimension, and can vary between different grid cells.
At each data point n_k,i (where i = 0,...,N_k^2-1) I have to store an array of solution values, which has the same length nVars in the whole domain (i.e. everywhere), and which does not change during runtime.
Dimensions and changes
The number of grid cells K is of O(10^5) to O(10^6) and can change during runtime.
The number of data points N_k in each grid cell is between 2 and 8 and can change during runtime (and may be different for different cells).
The number of variables nVars stored at each grid point is around 5 to 10 and cannot change during runtime (it is also the same for every grid cell).
Requirements
Performance is the key issue here. I need to be able to regularly iterate in an ordered fashion over all grid points of all cells in an efficient manner (i.e. without too many cache misses). Generally, K and N_k do not change very often during the simulation, so for example a large contiguous block of memory for all cells and data points could be an option.
However, I do need to be able to refine or coarsen the grid (i.e. delete cells and create new ones, the new ones may be appended to the end) during runtime. I also need to be able to change the approximation order N_k, so the number of data points I store for each cell can change during runtime as well.
Conclusion
Any input is appreciated. If you have experience yourself, or just know a few good resources that I could look at, please let me know. However, while the solution will be crucial to the performance of the final program, it is just one of many problems, so the solution needs to be of an applied nature and not purely academic.
Should this be the wrong venue to ask this question, please let me know what a more suitable place would be.
Often, these sorts of dynamic mesh structures can be very tricky to deal with efficiently, but in block-structured adaptive mesh refinement codes (common in astrophysics, where complex geometries aren't important) or your spectral element code where you have large block sizes, it is often much less of an issue. You have so much work to do per block/element (with at least 10^5 cells x 2 points/cell in your case) that the cost of switching between blocks is comparitively minor.
Keep in mind, too, that you can't generally do too much of the hard work on each element or block until a substantial amount of that block's data is already in cache. You're already going to have to had flushed most of block N's data out of cache before getting much work done on block N+1's anyway. (There might be some operations in your code which are exceptions to this, but those are probably not the ones where you're spending much time anyway, cache or no cache, because there's not a lot of data reuse - eg, elementwise operations on cell values). So keeping each the blocks/elements beside each other isn't necessarily a huge deal; on the other hand, you definitely want the blocks/elements to be themselves contiguous.
Also notice that you can move blocks around to keep them contiguous as things get resized, but not only are all those memory copies also going to wipe your cache, but the memory copies themselves get very expensive. If your problem is filling a significant fraction of memory (and aren't we always?), say 1GB, and you have to move 20% of that around after a refinement to make things contiguous again, that's .2 GB (read + write) / ~20 GB/s ~ 20 ms compared to reloading (say) 16k cache lines at ~100ns each ~ 1.5 ms. And your cache is trashed after the shuffle anyway. This might still be worth doing if you knew that you were going to do the refinement/derefinement very seldom.
But as a practical matter, most adaptive mesh codes in astrophysical fluid dynamics (where I know the codes well enough to say) simply maintain a list of blocks and their metadata and don't worry about their contiguity. YMMV of course. My suggestion would be - before spending too much time crafting the perfect data structure - to first just test the operation on two elements, twice; the first, with the elements in order and computing on them 1-2, and the second, doing the operation in the "wrong" order, 2-1, and timing the two computations several times.
For each cell, store the offset in which to find the cell data in a contiguous array. This offset mapping is very efficient and widely used. You can reorder the cells for cache reuse in traversals. When the order or number of cells changes, create a new array and interpolate, then throw away the old arrays. This storage is much better for external analysis because operations like inner products in Krylov methods and stages in Runge-Kutta methods can be managed without reference to the mesh. It also requires minimal memory per vector (e.g. in Krylov bases and with time integration).