CUDA matrix inversion by referencing CUDA-pointer - c++

Currently I'm just trying to implement simple Linear Regression algorithm in matrix-form based on cuBLAS with CUDA. Matrix multiplication and transposition works well with cublasSgemm function.
Problems begins with matrix inversions, based on cublas<t>getrfBatched() and cublas<t>getriBatched() functions (see here).
As it can be seen, input parameters of these functions - arrays of pointers to matrices. Imagine, that I've already allocated memory for (A^T * A) matrix on GPU as a result of previous calculations:
float* dProdATA;
cudaStat = cudaMalloc((void **)&dProdATA, n*n*sizeof(*dProdATA));
Is it possible to run factorization (inversion)
cublasSgetrfBatched(handle, n, &dProdATA, lda, P, INFO, mybatch);
without additional HOST <-> GPU memory copying (see working example of inverting array of matrices) and allocating arrays with single element, but just get GPU-reference to GPU-pointer?

There is no way around the requirement that the array you pass being in the device address space, and what you posted in your question won't work. You really only have two possibilities:
Allocate an array of pointers on the device and do the memory transfer (the solution you don't want to use).
Use zero-copy or managed host memory to store the batch array
In the latter case with managed memory, something like this should work (completely untested, use at own risk):
float ** batch;
cudaMallocManaged((&batch, sizeof(float *));
*batch = dProdATA;
cublasSgetrfBatched(handle, n, batch, lda, P, INFO, mybatch);

Related

Which one is faster? raw pointers vs thrust vectors

I am a beginner in Cuda, and I just wanted to ask a simple question that I could not find any clear answer for.
I know that we can define our array in Device memory using a raw pointer:
int *raw_ptr;
cudaMalloc((void **) &raw_ptr, N * sizeof(int));
And, we can also use Thrust to define a vector and push_back our items:
thrust::device_vector<int> D;
Actually, I need a huge amount of memory (like 500M int variables) to apply too many kernels on them in parallel. In terms of accessing the memory by kernels, is (when) using raw pointers faster than Thrust::vector?
The data in thrust::device_vector is ordinary global memory, there is no difference in access speed.
Note however that the two alternatives you present are not equivalent. cudaMalloc returns uninitialized memory. Memory in thrust::device_vector will be initialized. After allocation it launches a kernel for the initialization of its elements, followed by cudaDeviceSynchronize. This could slow down the code. You need to benchmark your code.

C++ array of Eigen dynamically sized matrix

In my application, I have a one-dimensional grid and for each grid point there is a matrix (equally sized and quadratic). For each matrix, a certain update procedure must be performed. At the moment, I define a type
typedef Eigen::Matrix<double, N, N> my_matrix_t;
and allocate the matrices for all grid points using
my_matrix_t *matrices = new my_matrix_t[num_gridpoints];
Now I would like to address matrices whose sizes are only known at run time (but still quadratic), i.e.,
typedef Eigen::Matrix<double, Dynamic, Dynamic> my_matrix_t;
The allocation procedure remains the same and the code seems to work. However, I assume that the array "matrices" contains only the pointers to the each individual matrix storage and the overall performance will degrade as the memory has to be collected from random places before the operation on each matrix can be carried out.
Q0: Contiguous Memory?
Is the assumption correct that
new[] will only store the pointers and the matrix data is stored anywhere on the head?
it is beneficial to have a contiguous memory region for such problems?
Q1: new[] or std::vector?
Using a std::vector was suggested in the comments. Does this make any difference? Advantages/drawbacks of both solutions?
Q2: Overloading new[]?
I think by overloading the operator new[] in the Eigen::Matrix class (or one of its bases) such an allocation could be achieved. Is this a good idea?
Q3: Alternative ways?
As an alternative, I could think of using a large Eigen::Matrix. Can anyone share their experience here? Do you have other suggestions for me?
Let us sum up what we have so far based on the comments to the question and the mailing list post here. I would like to encourage everyone to edit and add things.
Q0: Contiguous memory region.
Yes, only the pointers are stored (independent of using new[] or std::vector).
Generally, in HPC applications, continuous memory accesses are beneficial.
Q1: The basic mechanisms are the same.
However, std::vector offers more comfort and takes work off the developer. The latter also reduces mistakes and memory leaks.
Q2: Use std::vector.
Overloading new[] is not recommended as it is difficult to get it right. For example, alignment issues could lead to errors on different machines. In order to guarantee correct behavior on all machines, use
std::vector<my_matrix_t, Eigen::aligned_allocator<my_matrix_t>> storage;
as explained here.
Q3: Use a large Eigen Matrix for the complete grid.
Alternatively, let the Eigen library do the complete allocation directly by using on of its data structures. This guarantees that issues such as alignment and a continuous memory region are addressed properly. The matrix
Eigen::Matrix<double, Dynamic, Dynamic> storage(N, num_grid_points * N);
contains all matrices for the complete grid and can be addressed using
/* i + 1 'th matrix for i in [0, num_gridpoints - 1] */
auto matrix = storage.block(0, i * N, N, N);

How to change sub-matrix of a sparse matrix on CUDA device

I have a sparse matrix structure that I am using in conjunction with CUBLAS to implement a linear solver class. I anticipate that the dimensions of the sparse matrices I will be solving will be fairly large (on the order of 10^7 by 10^7).
I will also anticipate that the solver will need to be used many times and that a portion of this matrix will need be updated several times (between computing solutions) as well.
Copying the entire matrix sturcture from system memory to device memory could become quite a performance bottle neck since only a fraction of the matrix entries will ever need to be changed at a given time.
What I would like to be able to do is to have a way to update only a particular sub-set / sub-matrix rather than recopy the entire matrix structure from system memory to device memory each time I need to change the matrix.
The matrix data structure would reside on the CUDA device in arrays:
d_col, d_row, and d_val
On the system side I would have corresponding arrays I, J, and val.
So ideally, I would only want to change the subsets of d_val that correspond to the values in the system array, val, that changed.
Note that I do not anticipate that any entries will be added to or removed from the matrix, only that existing entries will change in value.
Naively I would think that to implement this, I would have an integer array or vector on the host side, e.g. updateInds , that would track the indices of entries in val that have changed, but I'm not sure how to efficiently tell the CUDA device to update the corresponding values of d_val.
In essence: how do I change the entries in a CUDA device side array (d_val) at indicies updateInds[1],updateInds[2],...,updateInds[n] to a new set of values val[updatInds[1]], val[updateInds[2]], ..., val[updateInds[3]], with out recopying the entire val array from system memory into CUDA device memory array d_val?
As long as you only want to change the numerical values of the value array associated with CSR (or CSC, or COO) sparse matrix representation, the process is not complicated.
Suppose I have code like this (excerpted from the CUDA conjugate gradient sample):
checkCudaErrors(cudaMalloc((void **)&d_val, nz*sizeof(float)));
...
cudaMemcpy(d_val, val, nz*sizeof(float), cudaMemcpyHostToDevice);
Now, subsequent to this point in the code, let's suppose I need to change some values in the d_val array, corresponding to changes I have made in val:
for (int i = 10; i < 25; i++)
val[i] = 4.0f;
The process to move these particular changes is conceptually the same as if you were updating an array using memcpy, but we will use cudaMemcpy to update the d_val array on the device:
cudaMemcpy(d_val+10, val+10, 15*sizeof(float), cudaMempcyHostToDevice);
Since these values were all contiguous, I can use a single cudaMemcpy call to effect the transfer.
If I have several disjoint regions similar to above, it will require several calls to cudaMemcpy, one for each region. If, by chance, the regions are equally spaced and of equal length:
for (int i = 10; i < 5; i++)
val[i] = 1.0f;
for (int i = 20; i < 5; i++)
val[i] = 2.0f;
for (int i = 30; i < 5; i++)
val[i] = 4.0f;
then it would also be possible to perform this transfer using a single call to cudaMemcpy2D. The method is outlined here.
Notes:
cudaMemcpy2D is slower than you might expect compared to a cudaMemcpy operation on the same number of elements.
CUDA API calls have some inherent overhead. If a large part of the matrix is to be updated in a scattered fashion, it may still be actually quicker to just transfer the whole d_val array, taking advantage of the fact that this can be done using a single cudaMemcpy operation.
The method described here cannot be used if non-zero values change their location in the sparse matrix. In that case, I cannot provide a general answer for how to surgically update a CSR sparse matrix on the device. And certain relatively simple changes could necessitate updating most of the array data (3 vectors) anyway.

Monte Carlo sweep in Cuda

I have a Monte Carlo step in Cuda that I need a help with. I already wrote the serial code, and it works as expected. Let's say I have a 256 particles, which are stored in
vector< vector<double> > *r;
Each i in r has (x,y) component both of which are double. Here, r is the position of a particle.
Now, in CUDA, I'm supposed to assign this vector in Host, and send it to Device. Once in device, these particles need to interact with each other. Each thread is supposed to run a Monte Carlo Sweep. How do I allocate memories, reference/dereference pointers using cudaMalloc, which functions to make global/shared,...---I just can't wrap my head around it.
Here's what my memory allocation looks at the moment::
cudaMalloc((void**)&r, (blocks*threads)*sizeof(double));
CUDAErrorCheck();
kernel <<<blocks, threads>>> (&r, randomnums);
cudaDeviceSynchronize();
CUDAErrorCheck();
cudaMemcpy(r, blocks*threads*sizeof(double), cudaMemcpyDeviceToHost);
The above code is at potato level. I guess, I'm not sure what to do---even conceptually. My main problem is on allocating memories, and passing information to and from device & host. The vector r needs to be allocated, copied from host to device, do something with it in device, and copy it back to host. Any help/"pointers" will be much appreciated.
Your "potato level" code demonstrates a general lack of understanding of CUDA, including but not limited to the management of the r data. I would suggest that you increase your knowledge of CUDA by taking advantage of some of the educational resources available, and then develop an understanding of at least one basic CUDA code, such as the vector add sample. You will then be much better able to frame questions and understand the responses you receive. An example:
This would almost never make sense:
cudaMalloc((void**)&r, (blocks*threads)*sizeof(double));
CUDAErrorCheck();
kernel <<<blocks, threads>>> (&r, randomnums);
You either don't know a very basic concept that data must be transferred to the device (via cudaMemcpy) before it can be used by a GPU kernel, or you can't be bothered to write "potato level" code that makes any sense at all - which would suggest to me a lack of effort in writing a sensible question. Also, regardless of what r is, passing &r to a cuda kernel would never make sense, I don't think.
Regarding your question about how to move r back and forth:
The first step in solving your problem will be to recast the r position data as something that is easily usable by a GPU kernel. In general, vector is not that useful for ordinary CUDA code and vector< vector< > > even less so. And if you have pointers floating about (*r) even less so. Therefore, flatten (copy) your position data into one or two dynamically allocated 1-D arrays of double:
#define N 1000
...
vector< vector<double> > r(N);
...
double *pos_x_h, *pos_y_h, *pos_x_d, *pos_y_d;
pos_x_h=(double *)malloc(N*sizeof(double));
pos_y_h=(double *)malloc(N*sizeof(double));
for (int i = 0; i<N; i++){
vector<double> temp = r[i];
pos_x_h[i] = temp[0];
pos_y_h[i] = temp[1];}
Now you can allocate space for the data on the device and copy the data to the device:
cudaMalloc(&pos_x_d, N*sizeof(double));
cudaMalloc(&pos_y_d, N*sizeof(double));
cudaMemcpy(pos_x_d, pos_x_h, N*sizeof(double), cudaMemcpyHostToDevice);
cudaMemcpy(pos_y_d, pos_y_h, N*sizeof(double), cudaMemcpyHostToDevice);
Now you can properly pass the position data to your kernel:
kernel<<<blocks, threads>>>(pos_x_d, pos_y_d, ...);
Copying the data back after the kernel will be approximately the
reverse of the above steps. This will get you started:
cudaMemcpy(pos_x_h, pos_x_d, N*sizeof(double), cudaMemcpyDeviceToHost);
cudaMemcpy(pos_y_h, pos_y_d, N*sizeof(double), cudaMemcpyDeviceToHost);
There are many ways to skin the cat, of course, the above is just an example. However the above data organization will be well suited to a kernel/thread strategy that assigns one thread to process one (x,y) position pair.

How and when should I use pitched pointer with the cuda API?

I have quite a good understanding about how to allocate and copy linear memory with cudaMalloc() and cudaMemcpy(). However, when I want to use the CUDA functions to allocate and copy 2D or 3D matrices, I am often befuddled by the various arguments, especially concerning pitched pointers which are always present when dealing with 2D/3D arrays. The documentation is good for providing a couple examples on how to use them but it assumes that I am familiar with the notion of padding and pitch, which I am not.
I usually end up tweaking the various examples I find in the documentation or somewhere else on the web, but the blind debugging that follows is quite painful, so my question is:
What is a pitch? How do I use it? How do I allocate and copy 2D and 3D arrays in CUDA?
Here is an explanation about pitched pointer and padding in CUDA.
Linear memory vs padded memory
First, lets start with the reason for the existence of non linear memory. When allocating memory with cudaMalloc, the result is like an allocation with malloc, we have a contiguous memory chunk of the size specified and we can put anything we want in it. If we want to allocate a vector of 10000 float, we simply do:
float* myVector;
cudaMalloc(&myVector, 10000*sizeof(float));
and then access ith element of myVector by classic indexing:
float element = myVector[i];
and if we want to access the next element, we just do:
float next_element = myvector[i+1];
It works very fine because accessing an element right next to the first one is (for reasons I am not aware of and I don't wish to be for now) cheap.
Things become a little bit different when we use our memory as a 2D array. Lets say our 10000 float vector is in fact a 100x100 array. We can allocate it by using the same cudaMalloc function, and if we want to read the i-th row, we do:
float* myArray;
cudaMalloc(&myArray, 10000*sizeof(float));
int row[100]; // number of columns
for (int j=0; j<100; ++j)
row[j] = myArray[i*100+j];
Word alignment
So we have to read memory from myArray+100*i to myArray+101*i-1. The number of memory access operation it will take depends on the number of memory words this row takes. The number of bytes in a memory word depends on the implementation. To minimize the number of memory accesses when reading a single row, we must assure that we start the row on the start of a word, hence we must pad the memory for every row until the start of a new one.
Bank conflicts
Another reason for padding arrays is the bank mechanism in CUDA, concerning shared memory access. When the array is in the shared memory, it is split into several memory banks. Two CUDA threads can access it simultaneously, provided they don't access memory belonging to the same memory bank. Since we usually want to treat each row in parallel, we can ensure that we can access it simulateously by padding each row to the start of a new bank.
Now, instead of allocating the 2D array with cudaMalloc, we will use cudaMallocPitch:
size_t pitch;
float* myArray;
cudaMallocPitch(&myArray, &pitch, 100*sizeof(float), 100); // width in bytes by height
Note that the pitch here is the return value of the function: cudaMallocPitch checks what it should be on your system and returns the appropriate value. What cudaMallocPitch does is the following:
Allocate the first row.
Check if the number of bytes allocated makes it correctly aligned. For example that it is a multiple of 128.
If not, allocate further bytes to reach the next multiple of 128. the pitch is then the number of bytes allocated for a single row, including the extra bytes (padding bytes).
Reiterate for each row.
At the end, we have typically allocated more memory than necessary because each row is now the size of pitch, and not the size of w*sizeof(float).
But now, when we want to access an element in a column, we must do:
float* row_start = (float*)((char*)myArray + row * pitch);
float column_element = row_start[column];
The offset in bytes between two successive columns can no more be deduced from the size of our array, that is why we want to keep the pitch returned by cudaMallocPitch. And since pitch is a multiple of the padding size (typically, the biggest of word size and bank size), it works great. Yay.
Copying data to/from pitched memory
Now that we know how to create and access a single element in an array created by cudaMallocPitch, we might want to copy whole part of it to and from other memory, linear or not.
Lets say we want to copy our array in a 100x100 array allocated on our host with malloc:
float* host_memory = (float*)malloc(100*100*sizeof(float));
If we use cudaMemcpy, we will copy all the memory allocated with cudaMallocPitch, including the padded bytes between each rows. What we must do to avoid padding memory is copying each row one by one. We can do it manually:
for (size_t i=0; i<100; ++i) {
cudaMemcpy(host_memory[i*100], myArray[pitch*i],
100*sizeof(float), cudaMemcpyDeviceToHost);
}
Or we can tell the CUDA API that we want only the useful memory from the memory we allocated with padding bytes for its convenience so if it could deal with its own mess automatically it would be very nice indeed, thank you. And here enters cudaMemcpy2D:
cudaMemcpy2D(host_memory, 100*sizeof(float)/*no pitch on host*/,
myArray, pitch/*CUDA pitch*/,
100*sizeof(float)/*width in bytes*/, 100/*heigth*/,
cudaMemcpyDeviceToHost);
Now the copy will be done automatically. It will copy the number of bytes specified in width (here: 100xsizeof(float)), height time (here: 100), skipping pitch bytes every time it jumps to a next row. Note that we must still provide the pitch for the destination memory because it could be padded, too. Here it is not, so the pitch is equal to the pitch of a non-padded array: it is the size of a row. Note also that the width parameter in the memcpy function is expressed in bytes, but the height parameter is expressed in number of elements. That is because of the way the copy is done, someway like I wrote the manual copy above: the width is the size of each copy along a row (elements that are contiguous in memory) and the height is the number of times this operation must be accomplished. (These inconsistencies in units, as a physicist, annoys me very much.)
Dealing with 3D arrays
3D arrays are no different that 2D arrays actually, there is no additional padding included. A 3D array is just a 2D classical array of padded rows. That is why when allocating a 3D array, you only get one pitch that is the difference in bytes count between to successive points along a row. If you want to access to successive points along the depth dimension, you can safely multiply the pitch by the number of columns, which gives you the slicePitch.
The CUDA API for accessing 3D memory is slightly different than the one for 2D memory, but the idea is the same :
When using cudaMalloc3D, you receive a pitch value that you must carefully keep for subsequent access to the memory.
When copying a 3D memory chunk, you cannot use cudaMemcpy unless you are copying a single row. You must use any other kind of copy utility provided by the CUDA utility that takes the pitch into account.
When you copy your data to/from linear memory, you must provide a pitch to your pointer even though it is irrelevant: this pitch is the size of a row, expressed in bytes.
The size parameters are expressed in bytes for the row size, and in number of elements for the column and depth dimension.