Efficient way to copy strided data (to and from a CUDA Device)?

Efficient way to copy strided data (to and from a CUDA Device)? - c++

Is there a possibility to copy data strided by a constant (or even non-constant) value to and from the CUDA device efficiently?
I want to diagonalize a large symmetric matrix.
Using the jacobi algorithm there is a bunch of operations using two rows and two columns within each iteration.
Since the Matrix itself is too big to be copied to the device entirely i am looking for a way to copy the two rows and columns to the device.
It would be nice to use the triangular matrix form to store the data but additional downsides like
non-constant row-length [not that Kind of a Problem]
non-constant stride of the column values [the stride increases by 1 for each row.]
arise.
[edit: Even using triangular form it is still impossible to store the whole Matrix on the GPU.]
I looked at some timings and recognized that copying strided values one by one is very slow (synchronous as well as async.).
// edit: removed solution - added an answer

Thanks to Robert Crovella for giving the right hint to use cudamemcpy2d.
I'll append my test code to give everyone the possibility to comprehend...
If anyone Comes up with suggestions for solving the copy problem using row-major-ordered triangular matrices - feel free to write another answer please.
__global__ void setValues (double *arr, double value)
{
arr[blockIdx.x] = value;
}
int main( void )
{
// define consts
static size_t const R = 10, C = 10, RC = R*C;
// create matrices and initialize
double * matrix = (double*) malloc(RC*sizeof(double)),
*final_matrix = (double*) malloc(RC*sizeof(double));
for (size_t i=0; i<RC; ++i) matrix[i] = rand()%R+10;
memcpy(final_matrix, matrix, RC*sizeof(double));
// create vectors on the device
double *dev_col, *dev_row,
*h_row = (double*) malloc(C*sizeof(double)),
*h_col = (double*) malloc(R*sizeof(double));
cudaMalloc((void**)&dev_row, C * sizeof(double));
cudaMalloc((void**)&dev_col, R * sizeof(double));
// choose row / col to copy
size_t selected_row = 7, selected_col = 3;
// since we are in row-major order we can copy the row at once
cudaMemcpy(dev_row, &matrix[selected_row*C],
C * sizeof(double), cudaMemcpyHostToDevice);
// the colum needs to be copied using cudaMemcpy2D
// with Columnsize*sizeof(type) as source pitch
cudaMemcpy2D(dev_col, sizeof(double), &matrix[selected_col],
C*sizeof(double), sizeof(double), R, cudaMemcpyHostToDevice);
// copy back to host to check whether we got the right column and row
cudaMemcpy(h_row, dev_row, C * sizeof(double), cudaMemcpyDeviceToHost);
cudaMemcpy(h_col, dev_col, R * sizeof(double), cudaMemcpyDeviceToHost);
// change values to evaluate backcopy
setValues<<<R, 1>>>(dev_col, 88.0); // column should be 88
setValues<<<C, 1>>>(dev_row, 99.0); // row should be 99
// backcopy
cudaMemcpy(&final_matrix[selected_row*C], dev_row,
C * sizeof(double), cudaMemcpyDeviceToHost);
cudaMemcpy2D(&final_matrix[selected_col], C*sizeof(double), dev_col,
sizeof(double), sizeof(double), R, cudaMemcpyDeviceToHost);
cudaDeviceSynchronize();
// output for checking functionality
printf("Initial Matrix:\n");
for (size_t i=0; i<R; ++i)
{
for (size_t j=0; j<C; ++j) printf(" %lf", matrix[i*C+j]);
printf("\n");
}
printf("\nRow %u values: ", selected_row);
for (size_t i=0; i<C; ++i) printf(" %lf", h_row[i]);
printf("\nCol %u values: ", selected_col);
for (size_t i=0; i<R; ++i) printf(" %lf", h_col[i]);
printf("\n\n");
printf("Final Matrix:\n");
for (size_t i=0; i<R; ++i)
{
for (size_t j=0; j<C; ++j) printf(" %lf", final_matrix[i*C+j]);
printf("\n");
}
cudaFree(dev_col);
cudaFree(dev_row);
free(matrix);
free(final_matrix);
free(h_row);
free(h_col);
cudaDeviceReset();
return 0;
}

Related

I am getting zeros as a result of vector additon in cuda and no errors

I am running a cuda vec addtion program and getting zeros as the output of its sum later. I have tried debugging but am not able to get to the problem at hand. It should be adding the numbers but is rather simply printing out zeros which I am not able to understand why is happening.
I have tried doing everything to the code and still I am not getting any output.
using namespace std;
__global__ void vecADDKernal(double *A, double *B, double *C, int n){
int id = blockIdx.x*blockDim.x+threadIdx.x;
if(id<n) C[id] = A[id] + B[id];
}
int main( ){
int n = 1048576;
int size = n*sizeof(double);
double *d_A, *d_B;
double *d_C;
double *h_A, *h_B, *h_C;
h_A = (double*)malloc(size);
h_B = (double*)malloc(size);
h_C = (double*)malloc(size);
cudaMalloc(&d_A, size);
cudaMalloc(&d_B, size);
cudaMalloc(&d_C, size);
int i;
// Initialize vectors on host
for( i = 0; i < n; i++ ) {
h_A[i] = 2*i;
h_B[i] = 3*i;
}
cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);
int blockSize = 256;
// Number of thread blocks in grid
int gridSize = ceil(n/blockSize);
vecADDKernal<<<gridSize, blockSize>>>(d_A, d_B, d_C, n);
cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);
double sum = 0;
for(int a = 0; a<n; a++) {
sum = h_C[a];
cout<<h_C[a]<<endl;
}
cout<<"HI "<< sum <<endl;
cudaFree(d_A);
cudaFree(d_B);
cudaFree(d_C);
return 0;
}

I just ran your code as is (only adding the necessary includes) and I got non-zero output. Have you verified that your device can run the nvidia provided sample successfully?
The sample is exactly what you are trying to do with vector addition, but with proper error checking and result verification.
A few notes:
The first line in your for loop assigns (=) a value to sum, instead
of adding the value to sum (+=), so you will only have the last value
in sum instead of the accumulated value.
Proper error checking helps, even with trivial examples. The sample
provides an example as does the answer Robert linked.
Did you try opening up the memory debugger to see if your values were
infact 0 in memory? printing to console is another place things can
go wrong.
You can use vectors to store your host data. You can access the raw
array for memCpy operations with vector.data() and it gives you easy
access to all sorts of useful things like range based for as well as
things like accumulate and fill functions.

Find the index of an element in an array with cuda c++

I am new with cuda.
I have two arrays:
int* AA = new int[5]{1,2,3,4,5};
int* BB = new int[5]{ 2,2,2,4,4 };
and I want to find the index of every element in AA that is equal to each element in BB that in this case is
{1,1,1,3,3}
here is My code:
__global__ void findIndex(int* A, int* B, int* C)
{
int i = threadIdx.x;
for (int j = 0; j < 5; j++)
{
if (B[i] == A[j])
{
C[i] = j;
}
}
}
int main() {
int* AA = new int[5]{1,2,3,4,5};
int* BB = new int[5]{ 2,2,2,4,4 };
int* CC = new int[5]{ 0,0,0,0,0 };
int(*ppA), (*ppB), (*ppC);
cudaMalloc((void**)&ppA, (5) * sizeof(int));
cudaMalloc((void**)&ppB, (5) * sizeof(int));
cudaMalloc((void**)&ppC, (5) * sizeof(int));
cudaMemcpy(ppA, AA, 5 * sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(ppB, BB, 5 * sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(ppC, CC, 5 * sizeof(int), cudaMemcpyHostToDevice);
int numBlocks = 1;
dim3 threadsPerBlock(5);
findIndex << <numBlocks, threadsPerBlock >> > (ppA, ppB, ppC);
cudaMemcpy(CC, ppC, 5 * sizeof(int), cudaMemcpyDeviceToHost);
for (int m = 0; m < 5; m++) {
printf("%d ", CC[m]);
}
}
My output is:
{1,2,3,0,0}
Can anyone help?

Simplest non-stable single-gpu solution would be using atomics, something like this:
__global__ void find(int * arr,int * counter, int * result)
{
int id = blockIdx.x*blockDim.x+threadIdx.x;
if(arr[id] == 4)
{
int ctr = atomicAdd(counter,1);
result[ctr] = id;
}
}
this way you can have an array of results in "result" array and if the wanted number is sparse it wouldn't slowdown much (like only few in whole source array). This is not an optimal way for multi-gpu systems, though. Requires host-side coordination between gpus, unless a special CUDA feature from newest toolkit is used (for system-level atomics).
If number of 4s leads to a "dense" arr array or if you have multiple gpus, then you should look for other solutions like stream compaction. First select the cells containing 4 as a mask. Then do the compaction. Some Nvidia blogs or tutorials have this algorithm.
For the "atomic" solution (especially on "shared" memory atomics), Maxwell (and onwards) architecture is much better than Kepler, just in case you still use a Kepler. Also using atomics is not exactly reproducible as the order of atomic operations can not be known. You will get a differently-ordered result array most of the time. But the stream compaction preserves the result order. This may save you from writing a sorting algorithm (like bitonic-sort, shear-sort, etc) on top of it.

two-dimensional arrays in CUDA

I'm practicing this simple code which takes a two-dimensional array and sums them up with CUDA. In the end, the result of C is not what I accepting. Also, I was wondering whether I can use vector instead of c-style arrays.
#include <iostream>
using namespace std;
#define N 2
__global__ void MatAdd(double** a, double** b,
double** c)
{
int i = threadIdx.x;
int j = threadIdx.y;
c[i][j] = a[i][j] + b[i][j];
}
int main()
{
double a[2][2]= {{1.0,2.0},{3.0,4.0}};
double b[2][2]= {{1.0,2.0},{3.0,4.0}};
double c[2][2]; // it will be the result!
double** a_d;
double** b_d;
double** c_d;
int d_size = N * N * sizeof(double);
int numBlocks = 1;
dim3 threadsPerBlock(N, N);
cudaMalloc(&a_d, d_size);
cudaMalloc(&b_d, d_size);
cudaMalloc(&c_d, d_size);
cudaMemcpy(a_d, a, d_size, cudaMemcpyHostToDevice);
cudaMemcpy(b_d, b, d_size, cudaMemcpyHostToDevice);
cudaMemcpy(c_d, c, d_size, cudaMemcpyHostToDevice);
MatAdd<<<numBlocks, threadsPerBlock>>>(a_d, b_d, c_d);
//cudaDeviceSynchronize();
cudaMemcpy(c, c_d, d_size, cudaMemcpyDeviceToHost);
for (int i=0; i<N; i++){
for(int j=0; j<N; j++){
cout<<c[i][j]<<endl;
}
}
return 0;
}

You must not use the double** type in this case. Alternatively, you should use a flatten array that contains all the values of a given matrix in a double*-type variable.
The heart of the problem is located in the following line (and the similar next ones):
cudaMemcpy(a_d, a, d_size, cudaMemcpyHostToDevice);
Here you assume that a and a_d are compatible types, but they are not. A double**-typed variable is a pointer that refer to one or more pointers in memory (typically an array of pointer referencing many different double-typed arrays), while a double*-typed variable or a static 2D C array refer to a contiguous location in memory.
Note that you can access to a given (i,j) cell of a matrix using matrix[N*i+j], where N is the number of column, assuming matrix is a flatten matrix of type double* and use a row-major ordering.

vector<vector<double>> to mxArray using memcpy

I have correlation matrix of a data and i want to use pca to transform them to uncorrelated set.
so i've decided to use matlab engine(c++ mex API) to perform the pca
my question is how to copy the matrix contents to mxArray efficiently
i used loops to allocate each element of matrix...on the other hand i've looked up for memcpy and it seems error prone.
although i've tested the following and it just copies the first column!
memcpy((double *)mxGetPr(T), &rho_mat[0][0], rows * sizeof(double));
what is the best way to copy the data (matrix -> mxArray and mxArray -> matrix) ?
void pca(vector<vector<double>>& rho_mat)
{
Engine *ep;
mxArray *T = NULL, *result = NULL;
if (!(ep = engOpen(""))) {
fprintf(stderr, "\nCan't start MATLAB engine\n");
return;
}
size_t rows = rho_mat.size();
size_t cols = rho_mat[0].size();
T = mxCreateDoubleMatrix(rows, cols, mxREAL);
double * buf = (double *)mxGetPr(T);
for (int i = 0; i<rows; i++) {
for (int j = 0; j<cols; j++) {
buf[i*(cols)+j] = rho_mat[i][j];
}
}
engPutVariable(ep, "T", T);
engEvalString(ep, "PC = pcacov(T);");
result = engGetVariable(ep, "PC");
}
Thanks
Regards

You can try using std::memcpy in a loop for each row.
for (int i = 0; i<rows; i++)
{
std::memcpy(buf + i*cols, &rho_mat[i][0], cols * sizeof(double));
}
Please note you have to use cols in you memcpy to ensure each row is copied. In your example, it might have been coincidental if your matrix was square.
You can refer to this answer on how to copy a 1-d vector using memcpy.
Edit:
To copy from 2-D array to 2-D vector(assuming vector is already of size rows*cols)
for (int i = 0; i<rows; i++)
{
std::memcpy(&rho_mat[i][0], buf + i*cols, cols * sizeof(double));
}
Please note the assumption made
OR
A much cleaner way would be to use std::assign or constructor to std::vector
if(rho_mat.size() == 0)
{
for (int i = 0; i<rows; i++)
{
rho_mat.push_back(vector<int>(buf + i*cols, buf + i*cols + cols));
//OR
//rho_mat.push_back(vector<int>());
//rho_mat[i].assign(buf + i*cols, buf + i*cols + cols);
}
}

MATLAB Tensor Indexing in C++

I am attempting to load in a .mat file containing a tensor of known dimensions in C++; 144x192x256.
I have adjusted the linear index for the read operation to be column major as in MATLAB. However I am still getting memory access issues.
void FeatureLoader::readMat(const std::string &fname, Image< std::vector<float> > *out) {
//Read MAT file.
const char mode = 'r';
MATFile *matFile = matOpen(fname.c_str(), &mode);
if (matFile == NULL) {
throw std::runtime_error("Cannot read MAT file.");
}
//Copy the data from column major to row major storage.
float *newData = newImage->GetData();
const mxArray *arr = matGetVariable(matFile, "map");
if (arr == NULL) {
throw std::runtime_error("Cannot read variable.");
}
double *arrData = (double*)mxGetPr(arr);
#pragma omp parallel for
for (int i = 0; i < 144; i++) {
#pragma omp parallel for
for (int j = 0; j < 192; j++) {
for (int k = 0; k < 256; k++) {
int rowMajIdx = (i * 192 + j) * 256 + k;
int colMajIdx = (j * 144 + i) * 256 + k;
newData[rowMajIdx] = static_cast<float>(arrData[colMajIdx]);
}
}
}
}
In the above snippet, am I right to be accessing the data linearly as with a flattened 3D array in C++? For example:-
idx_row_major = (x*WIDTH + y)*DEPTH + z
idx_col_major = (y*HEIGHT + x)*DEPTH + z
Is this the underlying representation that MATLAB uses?

You have some errors in the indexing of the row mayor and column mayor Idx. Additionally, naively accessing the data can lead to very slow times due to random memory access (memory latency is key! Read more here).
The best way to pass from MATLAB to C++ types (From 3D to 1D) is following the example below.
In this example we illustrate how to take a double real-type 3D matrix from MATLAB, and pass it to a C double* array.
The main objectives of this example are showing how to obtain data from MATLAB MEX arrays and to highlight some small details in matrix storage and handling.
matrixIn.cpp
#include "mex.h"
void mexFunction(int nlhs , mxArray *plhs[],
int nrhs, mxArray const *prhs[]){
// check amount of inputs
if (nrhs!=1) {
mexErrMsgIdAndTxt("matrixIn:InvalidInput", "Invalid number of inputs to MEX file.");
}
// check type of input
if( !mxIsDouble(prhs[0]) || mxIsComplex(prhs[0])){
mexErrMsgIdAndTxt("matrixIn:InvalidType", "Input matrix must be a double, non-complex array.");
}
// extract the data
double const * const matrixAux= static_cast<double const *>(mxGetData(prhs[0]));
// Get matrix size
const mwSize *sizeInputMatrix= mxGetDimensions(prhs[0]);
// allocate array in C. Note: its 1D array, not 3D even if our input is 3D
double* matrixInC= (double*)malloc(sizeInputMatrix[0] *sizeInputMatrix[1] *sizeInputMatrix[2]* sizeof(double));
// MATLAB is column major, not row major (as C). We need to reorder the numbers
// Basically permutes dimensions
// NOTE: the ordering of the loops is optimized for fastest memory access!
// This improves the speed in about 300%
const int size0 = sizeInputMatrix[0]; // Const makes compiler optimization kick in
const int size1 = sizeInputMatrix[1];
const int size2 = sizeInputMatrix[2];
for (int j = 0; j < size2; j++)
{
int jOffset = j*size0*size1; // this saves re-computation time
for (int k = 0; k < size0; k++)
{
int kOffset = k*size1; // this saves re-computation time
for (int i = 0; i < size1; i++)
{
int iOffset = i*size0;
matrixInC[i + jOffset + kOffset] = matrixAux[iOffset + jOffset + k];
}
}
}
// we are done!
// Use your C matrix here
// free memory
free(matrixInC);
return;
}
The relevant concepts to be aware of:
MATLAB matrices are all 1D in memory, no matter how many dimensions they have when used in MATLAB. This is also true for most (if not all) main matrix representation in C/C++ libraries, as allows optimization and faster execution.
You need to explicitly copy matrices from MATLAB to C in a loop.
MATLAB matrices are stored in column major order, as in Fortran, but C/C++ and most modern languages are row major. It is important to permute the input matrix , or else the data will look completely different.
The relevant function in this example are:
mxIsDouble checks if input is double type.
mxIsComplex checks if input is real or imaginary.
mxGetData returns a pointer to the real data in the input array. NULL if there is no real data.
mxGetDimensions returns an pointer to a mwSize array, with the size of the dimension in each index.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Efficient way to copy strided data (to and from a CUDA Device)? - c++

Related

I am getting zeros as a result of vector additon in cuda and no errors

Find the index of an element in an array with cuda c++

two-dimensional arrays in CUDA

vector<vector<double>> to mxArray using memcpy

MATLAB Tensor Indexing in C++

Categories

Resources