How to change sub-matrix of a sparse matrix on CUDA device - c++

I have a sparse matrix structure that I am using in conjunction with CUBLAS to implement a linear solver class. I anticipate that the dimensions of the sparse matrices I will be solving will be fairly large (on the order of 10^7 by 10^7).
I will also anticipate that the solver will need to be used many times and that a portion of this matrix will need be updated several times (between computing solutions) as well.
Copying the entire matrix sturcture from system memory to device memory could become quite a performance bottle neck since only a fraction of the matrix entries will ever need to be changed at a given time.
What I would like to be able to do is to have a way to update only a particular sub-set / sub-matrix rather than recopy the entire matrix structure from system memory to device memory each time I need to change the matrix.
The matrix data structure would reside on the CUDA device in arrays:
d_col, d_row, and d_val
On the system side I would have corresponding arrays I, J, and val.
So ideally, I would only want to change the subsets of d_val that correspond to the values in the system array, val, that changed.
Note that I do not anticipate that any entries will be added to or removed from the matrix, only that existing entries will change in value.
Naively I would think that to implement this, I would have an integer array or vector on the host side, e.g. updateInds , that would track the indices of entries in val that have changed, but I'm not sure how to efficiently tell the CUDA device to update the corresponding values of d_val.
In essence: how do I change the entries in a CUDA device side array (d_val) at indicies updateInds[1],updateInds[2],...,updateInds[n] to a new set of values val[updatInds[1]], val[updateInds[2]], ..., val[updateInds[3]], with out recopying the entire val array from system memory into CUDA device memory array d_val?

As long as you only want to change the numerical values of the value array associated with CSR (or CSC, or COO) sparse matrix representation, the process is not complicated.
Suppose I have code like this (excerpted from the CUDA conjugate gradient sample):
checkCudaErrors(cudaMalloc((void **)&d_val, nz*sizeof(float)));
...
cudaMemcpy(d_val, val, nz*sizeof(float), cudaMemcpyHostToDevice);
Now, subsequent to this point in the code, let's suppose I need to change some values in the d_val array, corresponding to changes I have made in val:
for (int i = 10; i < 25; i++)
val[i] = 4.0f;
The process to move these particular changes is conceptually the same as if you were updating an array using memcpy, but we will use cudaMemcpy to update the d_val array on the device:
cudaMemcpy(d_val+10, val+10, 15*sizeof(float), cudaMempcyHostToDevice);
Since these values were all contiguous, I can use a single cudaMemcpy call to effect the transfer.
If I have several disjoint regions similar to above, it will require several calls to cudaMemcpy, one for each region. If, by chance, the regions are equally spaced and of equal length:
for (int i = 10; i < 5; i++)
val[i] = 1.0f;
for (int i = 20; i < 5; i++)
val[i] = 2.0f;
for (int i = 30; i < 5; i++)
val[i] = 4.0f;
then it would also be possible to perform this transfer using a single call to cudaMemcpy2D. The method is outlined here.
Notes:
cudaMemcpy2D is slower than you might expect compared to a cudaMemcpy operation on the same number of elements.
CUDA API calls have some inherent overhead. If a large part of the matrix is to be updated in a scattered fashion, it may still be actually quicker to just transfer the whole d_val array, taking advantage of the fact that this can be done using a single cudaMemcpy operation.
The method described here cannot be used if non-zero values change their location in the sparse matrix. In that case, I cannot provide a general answer for how to surgically update a CSR sparse matrix on the device. And certain relatively simple changes could necessitate updating most of the array data (3 vectors) anyway.

Related

Combine two separate buffers into a complex one

i'm working to a project involving realtime rx and tx trasmission using a Software Defined Radio.
I have to pass the SDR transmission API a Complex Float array buffer to be sent. I'm trying to implement every feature avoid using "for loops" since working each element at a time slow down the code execution, since i need to do many upsampling, FIR filtering and other computing intensive computation.
Now i am facing a problem. Suppose i have two separate buffers one representing the real part and the other the imag part of the complex samples buffer i have to pass to the API tx function.
Say real buffer is RRRRRRRRRRRR while the imag buffer is IIIIIIIIIIII. The example is for 12 samples but really it could be 2048, 4096 or more ...
int size=12
float *reals,*imags;
reals = new float[size];
image = new float[size];
Now i need an output that is defined as
complex<float> *cplxOut;
cplxOut = new complex<float>[size];
In memory this object is stored as RIRIRIRIRIRIRIRIRIRIRIRI.
build the cplxOut from the two real and imag buffers is easy using a for loop
for (int i=0;i<size;i++)
{
(*(cplxOut+i)).real(*(reals+i));
(*(cplxOut+i)).imag(*(imags+i));
}
I wonder if there is a quickest way to do it using direct memory move functions for whole buffers.
I tried to use inline assembly to speed up the task but it has problem in portability on different architectures and is not supported for x64 on Windows side.
A possibile way could be upsample by two interleaving with zero, shift forward the imag buffer of 1 place and then OR both buffers, but to do upsampling i have to use a for loop as well.. so no way.
Do you have any suggestion ? I need the fastest way to do it.
Tnx, Fabio
You don't have to construct an array of empty complex objects, but use std::vector and emplace_back() instead:
vector<complex<float>> cplxOut;
// to avoid reallocations when adding new elements
cplxOut.reserve(size);
for (int i = 0; i < size; i++)
{
// create in-place complex number
cplxOut.emplace_back(reals[i], imags[i]);
}

How does two array lookup codes have different speeds [duplicate]

This question already has answers here:
Accessing elements of a matrix row-wise versus column-wise
(3 answers)
Closed 4 years ago.
I have an array, long matrix[8*1024][8*1024], and two functions sum1 and sum2:
long sum1(long m[ROWS][COLS]) {
long register sum = 0;
int i,j;
for (i=0; i < ROWS; i++) {
for (j=0; j < COLS; j++) {
sum += m[i][j];
}
}
return sum;
}
long sum2(long m[ROWS][COLS]) {
long register sum = 0;
int i,j;
for (j=0; j < COLS; j++) {
for (i=0; i < ROWS; i++) {
sum += m[i][j];
}
}
return sum;
}
When I execute the two functions with the given array, I get running times:
sum1: 0.19s
sum2: 1.25s
Can anyone explain why there is this huge difference?
C uses row-major ordering to store multidimensional arrays, as documented in § 6.5.2.1 Array subscripting, paragraph 3 of the C Standard:
Successive subscript operators designate an element of a multidimensional array object. If E is an n-dimensional array (n >= 2) with dimensions i x j x . . . x k, then E (used as other than an lvalue) is converted to a pointer to an (n - 1)-dimensional array with dimensions j x . . . x k. If the unary * operator is applied to this pointer explicitly, or implicitly as a result of subscripting, the result is the referenced (n - 1)-dimensional array, which itself is converted into a pointer if used as other than an lvalue. It follows from this that arrays are stored in row-major order (last subscript varies fastest).
Emphasis mine.
Here's an image from Wikipedia that demonstrates this storage technique compared to the other method for storing multidimensional arrays, column-major ordering:
The first function, sum1, accesses data consecutively per how the 2D array is actually represented in memory, so the data from the array is already in the cache. sum2 requires fetching of another row on each iteration, which is less likely to be in the cache.
There are some other languages that use column-major ordering for multidimensional arrays; among them are R, FORTRAN and MATLAB. If you wrote equivalent code in these languages you would observe faster output with sum2.
Computers generally use cache to help speed up access to main memory.
The hardware usually used for main memory is relatively slow—it can take many processor cycles for data to come from main memory to the processor. So a computer generally includes a smaller amount very fast but expensive memory called cache. Computers may have several levels of cache, some of it is built into the processor or the processor chip itself and some of it is located outside the processor chip.
Since the cache is smaller, it cannot hold everything in main memory. It often cannot even hold everything that one program is using. So the processor has to make decisions about what is kept in cache.
The most frequent accesses of a program are to consecutive locations in memory. Very often, after a program reads element 237 of an array, it will soon read 238, then 239, and so on. It is less often that it reads 7024 just after reading 237.
So the operation of cache is designed to keep portions of main memory that are consecutive in cache. Your sum1 program works well with this because it changes the column index most rapidly, keeping the row index constant while all the columns are processed. The array elements it accesses are laid out consecutively in memory.
Your sum2 program does not work well with this because it changes the row index most rapidly. This skips around in memory, so many of the accesses it makes are not satisfied by cache and have to come from slower main memory.
Related Resource: Memory layout of multi-dimensional arrays
On a machine with data cache (even a 68030 has one), reading/writing data in consecutive memory locations is way faster, because a block of memory (size depends on the processor) is fetched once from memory and then recalled from the cache (read operation) or written all at once (cache flush for write operation).
By "skipping" data (reading far from the previous read), the CPU has to read the memory again.
That's why your first snippet is faster.
For more complex operations (fast fourier transform for instance), where data is read more than once (unlike your example) a lot of libraries (FFTW for instance) propose to use a stride to accomodate your data organization (in rows/in columns). Never use it, always transpose your data first and use a stride of 1, it will be faster than trying to do it without transposition.
To make sure your data is consecutive, never use 2D notation. First position your data in the selected row and set a pointer to the start of the row, then use an inner loop on that row.
for (i=0; i < ROWS; i++) {
const long *row = m[i];
for (j=0; j < COLS; j++) {
sum += row[j];
}
}
If you cannot do this, that means that your data is wrongly oriented.
This is an issue with the cache.
The cache will automatically read data that lies after the data you requested. So if you read the data row by row, the next data you request will already be in the cache.
A matrix, in memory, is align linearly, such that the items in a row are next to each other in memory (spacial locality). When you transverse items in order such that you go through all of the columns in a row before moving onto the next one, when the CPU comes across an entry that isn't loaded into its cache yet, it will go and load that value along with a whole block of other values close to it in physical memory so the next several values will already be cached by the time it needs to read them.
When you transverse them the other way, the other values it loads in that are near it in memory are not going to be the next ones read, so you wind up with a lot more cache misses and so the CPU has to sit and wait while the data is brought in from the next layer of the memory hierarchy.
By the time you swing back around to another entry that you had previously cached, it more than likely has been booted out of the cache in favor of all the other data you've since loaded in as it will not have been recently used anymore (temporal locality)
To expand on the other answers that this is due to cache-misses for the second program, and assuming that you are using Linux, *BSD, or MacOS, then Cachegrind may give you enlightenment. It's part of valgrind, and will run your program, without changes, and print the cache usage statistics. It does run very slowly though.
http://valgrind.org/docs/manual/cg-manual.html

Cache-Friendly Matrix, for access to adjacent cells

Current Design
In my program I have a big 2-D grid (1000 x 1000, or more), each cell contains a small information.
In order to represent this concept the choice is quite trivial: a matrix data structure.
The correspondent code (in C++) is something like:
int w_size_grid = 1000;
int h_size_grid = 1000;
int* matrix = new int[w_size_grid * h_size_grid];
As you can notice I've used a vector, but the principle is the same.
In order to access an element of the grid, we need a function that given a cell in the grid, identified by (x,y), it returns the value stored in that cell.
Mathematically:
f(x,y) -> Z
obviously:
f: Z^2 -> Z where Z is the set of integer numbers.
That can be trivially achieved with a linear function. Here a C++ code representation:
int get_value(int x, int y) {
return matrix[y*w_size_grid + x];
}
Additional Implementation Notes
Actually the design concept requires a sort of "circular-continuous grid": the access indices for the cell can go out the limits of the grid itself.
That means, for example, the particular case: get_value(-1, -1); is still valid. The function will just return the same value as get_value(w_size_grid - 1, h_size_grid -1);.
Actually this is no a problem in the implementation:
int get_value(int x, int y) {
adjust_xy(&x, &y); // modify x and y in accordance with that rule.
return matrix[y*w_size_grid + x];
}
Anyway this is just an additional note in order to make the scenario more clear.
What is the problem?
The problem presented above is very trivial and simple to design and to implement.
My problem comes with the fact that the matrix is updated with an high frequency. Each cell in the matrix is read and possibly updated with a new value.
Obviously the parsing of the matrix is done with two loop in according to a cache-friend design:
for (int y = 0; y < h_size_grid; ++y) {
for (int x = 0; x < w_size_grid; ++x) {
int value = get_value(x, y);
}
}
The inner cycle is x since [x-1] [x] [x+1] are stored contiguously. Indeed, that cycle exploits principle of locality.
The problem comes now because, actually in order to update a value in a cell, it depends on values in the adjacent cells.
Each cell has exactly eight neighbours, which are the cells that are horizontally, vertically, or diagonally adjacent.
(-1,-1) | (0,-1) | (1,-1)
-------------------------
(-1,0) | (0,0) | (0, 1)
-------------------------
(-1,1) | (0,1) | (1,1)
So the code is intuitively:
for (int y = 0; y < h_size_grid; ++y) {
for (int x = 0; x < w_size_grid; ++x) {
int value = get_value(x, y);
auto values = get_value_all_neighbours(x, y); // values are 8 integer
}
}
The function get_value_all_neighbours(x,y) will access one row up and one row down in the matrix relatively to y.
Since a row in the matrix is quite big, it involves a cache miss and it dirties the caches themselves.
The Question
One I have finally present you the scenario and the problem, my question is how to "solve" the problem.
Using some additional data structures, or reorganizing the data is there a way to exploit the caches and to avoid all those miss?
Some Personal Consideration
My feelings guide me toward a strategic data structure.
I've thought about a reimplementation of the order in which the values are stored in the vector, trying to stored in contiguous indices those cell which are neighbours.
That implies a no-more-linear function for get_value.
After some thinking, I believe is not possible to find this no-linear function.
I've also thought some additional data-structure like hash-table to store adjacent value for each cell, but I think is an overkill more in space and maybe in CPU cycle also.
Lets assume you have indeed a problem with cache misses that can't easily be avoided (referring to other answers here).
You could use a space filling curve to organize your data in a cache friendly way. Essentially, space filling curves map a volume or plane (such as your matrix) to a linear representations, such that values that are close together in space end (mostly) up close together in the linear representation. In effect, if you store the matrix in a z-ordered array, neighbouring elements have a high likelihood of being on the same memory page.
The best proximity mapping is available with the Hilbert Curve, but it is expensive to calculate. A better option may be a z-curve (Morton-Order). It provides good proximity and is fast to calculate.
Z-Curve: Essentially, to get the ordering, you have to interleave the bits of your x and y coordinate into a single value, called 'z-value'. This z-value determines the position in your list of matrix values (or even simply the index in an array if you use an array). The z-values are consecutive for a completely filled matrix (every cell is used). Conversely, you can de-interleave the position in the list (=array index) and get your x/y coordinates back.
Interleaving bits of two values is quite fast, there are even dedicated CPU instructions to do this with few cycles. If you can't find these (I can't, at the moment), you can simply use some bit twiddling tricks.
Actually, the data structure is not trivial, especially when optimizations are concerned.
There are two main issues to resolve: data content and data usage. Data content are the values in the data and the usage is how the data is stored, retrieved and how often.
Data Content
Are all the values accessed? Frequently?
Data that is not accessed frequently can be pushed to slower media, including files. Leave the fast memory (such as data caches) for the frequently accessed data.
Is the data similar? Are there patterns?
There are alternative methods for representing matrices where a lot of the data is the same (such as a sparse matrix or a lower triangular matrix). For large matrices, maybe performing some checks and returning constant values may be faster or more efficient.
Data Usage
Data usage is a key factor in determining an efficient structure for the data. Even with matrices.
For example, for frequently access data, a map or associative array may be faster.
Sometimes, using many local variables (i.e. registers) may be more efficient for processing matrix data. For example, load up registers with values first (data fetches), operate using the registers, then store the registers back into memory. For most processors, registers are the fastest media for holding data.
The data may want to be rearranged to make efficient use of data cache's and cache lines. The data cache is a high speed area of memory very close to the processor core. A cache line is one row of data in the data cache. An efficient matrix can fit one or more row per cache line.
The more efficient method is to perform as many accesses to a data cache line as possible. Prefer to reduce the need to reload the data cache (because an index was out of range).
Can the operations be performed independently?
For example, scaling a matrix, where each location is multiplied by a value. These operations don't depend on other cells of the matrix. This allows the operations to be performed in parallel. If they can be performed in parallel, then they can be delegated to processors with multiple cores (such as GPUs).
Summary
When a program is data driven, the choice of data structures is not trivial. The content and usage are important factors when choosing a structure for the data and how the data is aligned. Usage and performance requirements will also determine the best algorithms for accessing the data. There are already many articles on the internet for optimizing for data driven applications and best usage of data caches.

Efficient (time and space complexity) data structure for dense and sparse matrix

I have to read a file in which is stored a matrix with cars (1=BlueCar, 2=RedCar, 0=Empty).
I need to write an algorithm to move the cars of the matrix in that way:
blue ones move downward;
red ones move rightward;
there is a turn in which all the blue ones move and a turn to move all the red ones.
Before the file is read I don't know the matrix size and if it's dense or sparse, so I have to implement two data structures (one for dense and one for sparse) and two algorithms.
I need to reach the best time and space complexity possible.
Due to the unknown matrix size, I think to store the data on the heap.
If the matrix is dense, I think to use something like:
short int** M = new short int*[m];
short int* M_data = new short int[m*n];
for(int i=0; i< m; ++i)
{
M[i] = M_data + i * n;
}
With this structure I can allocate a contiguous space of memory and it is also simple to be accessed with M[i][j].
Now the problem is the structure to choose for the sparse case, and I have to consider also how I can move the cars through the algorithm in the simplest way: for example when I evaluate a car, I need to find easily if in the next position (downward or rightward) there is another car or if it's empty.
Initially I thought to define BlueCar and RedCar objects that inherits from the general Car object. In this objects I can save the matrix coordinates and then put them in:
std::vector<BluCar> sparseBlu;
std::vector<RedCar> sparseRed;
Otherwise I can do something like:
vector< tuple< row, column, value >> sparseMatrix
But the problem of finding what's in the next position still remains.
Probably this is not the best way to do it, so how can I implement the sparse case in a efficient way? (also using a unique structure for sparse)
Why not simply create a memory mapping directly over the file? (assuming your data 0,1,2 is stored in contiguous bytes (or bits) in the file, and the position of those bytes also represents the coordinates of the cars)
This way you don't need to allocate extra memory and read in all the data, and the data can simply and efficiently be accessed with M[i][j].
Going over the rows would be L1-cache friendly.
In case of very sparse data, you could scan through the data once and keep a list of the empty regions/blocks in memory (only need to store startpos and size), which you could then skip (and adjust where needed) in further runs.
With memory mapping, only frequently accessed pages are kept in memory. This means that once you have scanned for the empty regions, memory will only be allocated for the frequently accessed non-empty regions (all this will be done automagically by the kernel - no need to keep track of it yourself).
Another benefit is that you are accessing the OS disk cache directly. Thus no need to keep copying and moving data between kernel space and user space.
To further optimize space- and memory usage, the cars could be stored in 2 bits in the file.
Update:
I'll have to move cars with openMP and MPI... Will the memory mapping
work also with concurrent threads?
You could certainly use multithreading, but not sure if openMP would be the best solution here, because if you work on different parts of the data at the same time, you may need to check some overlapping regions (i.e. a car could move from one block to another).
Or you could let the threads work on the middle parts of the blocks, and then start other threads to do the boundaries (with red cars that would be one byte, with blue cars a full row).
You would also need a locking mechanism for adjusting the list of the sparse regions. I think the best way would be to launch separate threads (depending on the size of the data of course).
In a somewhat similar task, I simply made use of Compressed Row Storage.
The Compressed Row and Column (in the next section) Storage formats
are the most general: they make absolutely no assumptions about the
sparsity structure of the matrix, and they don't store any unnecessary
elements. On the other hand, they are not very efficient, needing an
indirect addressing step for every single scalar operation in a
matrix-vector product or preconditioner solve.
You will need to be a bit more specific about time and space complexity requirements. CSR requires an extra indexing step for simple operations, but that is a minor amount of overhead if you're just doing simple matrix operations.
There's already an existing C++ implementation available online as well.

Filling counting 'buckets' in CUDA threads

In my program, I'm tracking a large number of particles through a voxel grid. The ratio of particles to voxels is arbitrary. At a certain point, I need to know which particles lie in which voxels, and how many do. Specifically, the voxels must know exactly which particles are contained within them. Since I can't use anything like std::vector in CUDA, I'm using the following algorithm (at the high level):
Allocate an array of ints the size of the number of voxels
Launch threads for the all the particles, determine the voxel each one lies in, and increase the appropriate counter in my 'bucket' array
Allocate an array of pointers the size of the number of particles
Calculate each voxel's offset into this new array (summing the number of particles in the voxels preceding it)
Place the particles in the array in an ordered fashion (I use this data to accelerate an operation later on. The speed increase is well worth the increased memory usage).
This breaks down on the second step though. I haven't been programming in CUDA for long, and just found out that simultaneous writes among threads to the same location in global memory produce undefined results. This is reflected in the fact that I mostly get 1's in buckets, with the occasional 2. Here's an sketch of the code I'm using for this step:
__global__ void GPU_AssignParticles(Particle* particles, Voxel* voxels, int* buckets) {
int tid = threadIdx.x + blockIdx.x*blockDim.x;
if(tid < num_particles) { // <-- you can assume I actually passed this to the function :)
// Some math to determine the index of the voxel which this particle
// resides in.
buckets[index] += 1;
}
}
My question is, what's the proper way to generate these counts in CUDA?
Also, is there a way to store references to the particles within the voxels? The issue I see there is that the number of particles within a voxel constantly changes, so new arrays would have to be deallocated and reallocated almost every frame.
Although there may be more efficient solutions for calculating the bucket counts, a first working solution is to use your current approach, but using an atomic increment. This way only one thread at a time increments the bucket count atomically (synchronized over the whole grid):
if(tid < num_particles) {
// ...
atomicAdd(&buckets[index], 1);
}