I am new with cuda.
I have two arrays:
int* AA = new int[5]{1,2,3,4,5};
int* BB = new int[5]{ 2,2,2,4,4 };
and I want to find the index of every element in AA that is equal to each element in BB that in this case is
{1,1,1,3,3}
here is My code:
__global__ void findIndex(int* A, int* B, int* C)
{
int i = threadIdx.x;
for (int j = 0; j < 5; j++)
{
if (B[i] == A[j])
{
C[i] = j;
}
}
}
int main() {
int* AA = new int[5]{1,2,3,4,5};
int* BB = new int[5]{ 2,2,2,4,4 };
int* CC = new int[5]{ 0,0,0,0,0 };
int(*ppA), (*ppB), (*ppC);
cudaMalloc((void**)&ppA, (5) * sizeof(int));
cudaMalloc((void**)&ppB, (5) * sizeof(int));
cudaMalloc((void**)&ppC, (5) * sizeof(int));
cudaMemcpy(ppA, AA, 5 * sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(ppB, BB, 5 * sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(ppC, CC, 5 * sizeof(int), cudaMemcpyHostToDevice);
int numBlocks = 1;
dim3 threadsPerBlock(5);
findIndex << <numBlocks, threadsPerBlock >> > (ppA, ppB, ppC);
cudaMemcpy(CC, ppC, 5 * sizeof(int), cudaMemcpyDeviceToHost);
for (int m = 0; m < 5; m++) {
printf("%d ", CC[m]);
}
}
My output is:
{1,2,3,0,0}
Can anyone help?
Simplest non-stable single-gpu solution would be using atomics, something like this:
__global__ void find(int * arr,int * counter, int * result)
{
int id = blockIdx.x*blockDim.x+threadIdx.x;
if(arr[id] == 4)
{
int ctr = atomicAdd(counter,1);
result[ctr] = id;
}
}
this way you can have an array of results in "result" array and if the wanted number is sparse it wouldn't slowdown much (like only few in whole source array). This is not an optimal way for multi-gpu systems, though. Requires host-side coordination between gpus, unless a special CUDA feature from newest toolkit is used (for system-level atomics).
If number of 4s leads to a "dense" arr array or if you have multiple gpus, then you should look for other solutions like stream compaction. First select the cells containing 4 as a mask. Then do the compaction. Some Nvidia blogs or tutorials have this algorithm.
For the "atomic" solution (especially on "shared" memory atomics), Maxwell (and onwards) architecture is much better than Kepler, just in case you still use a Kepler. Also using atomics is not exactly reproducible as the order of atomic operations can not be known. You will get a differently-ordered result array most of the time. But the stream compaction preserves the result order. This may save you from writing a sorting algorithm (like bitonic-sort, shear-sort, etc) on top of it.
Related
So I need to create a matrix with different row lengths, and this is how it looks like in normal C/C++
int** MpesosT = (int**)malloc(N * sizeof(int*));
for (int i = 0; i < N; i++)
{
MpesosT[i] = (int*)malloc(vecinosT[i] * sizeof(int));
}
However, I don't know how to do this using the CUDA function to allocate memory:
int* Vector; cudaMallocManaged(&Vector, VectorSize* sizeof(int));
I can't just use a vector of size N*N or something, because every row has a different size, so how could I do that?
Took a couple of hours, but I found the way to do it. In case anyone has the same problem:
double** Matrix;
cudaMallocManaged((double***)&Matrix, N * sizeof(double*));
for (i = 0; i < N; i++)
{
cudaMallocManaged((double**)&Matrix[i], rowlength[i] * sizeof(double));
}
This way, every row has a different length
I am attempting to load in a .mat file containing a tensor of known dimensions in C++; 144x192x256.
I have adjusted the linear index for the read operation to be column major as in MATLAB. However I am still getting memory access issues.
void FeatureLoader::readMat(const std::string &fname, Image< std::vector<float> > *out) {
//Read MAT file.
const char mode = 'r';
MATFile *matFile = matOpen(fname.c_str(), &mode);
if (matFile == NULL) {
throw std::runtime_error("Cannot read MAT file.");
}
//Copy the data from column major to row major storage.
float *newData = newImage->GetData();
const mxArray *arr = matGetVariable(matFile, "map");
if (arr == NULL) {
throw std::runtime_error("Cannot read variable.");
}
double *arrData = (double*)mxGetPr(arr);
#pragma omp parallel for
for (int i = 0; i < 144; i++) {
#pragma omp parallel for
for (int j = 0; j < 192; j++) {
for (int k = 0; k < 256; k++) {
int rowMajIdx = (i * 192 + j) * 256 + k;
int colMajIdx = (j * 144 + i) * 256 + k;
newData[rowMajIdx] = static_cast<float>(arrData[colMajIdx]);
}
}
}
}
In the above snippet, am I right to be accessing the data linearly as with a flattened 3D array in C++? For example:-
idx_row_major = (x*WIDTH + y)*DEPTH + z
idx_col_major = (y*HEIGHT + x)*DEPTH + z
Is this the underlying representation that MATLAB uses?
You have some errors in the indexing of the row mayor and column mayor Idx. Additionally, naively accessing the data can lead to very slow times due to random memory access (memory latency is key! Read more here).
The best way to pass from MATLAB to C++ types (From 3D to 1D) is following the example below.
In this example we illustrate how to take a double real-type 3D matrix from MATLAB, and pass it to a C double* array.
The main objectives of this example are showing how to obtain data from MATLAB MEX arrays and to highlight some small details in matrix storage and handling.
matrixIn.cpp
#include "mex.h"
void mexFunction(int nlhs , mxArray *plhs[],
int nrhs, mxArray const *prhs[]){
// check amount of inputs
if (nrhs!=1) {
mexErrMsgIdAndTxt("matrixIn:InvalidInput", "Invalid number of inputs to MEX file.");
}
// check type of input
if( !mxIsDouble(prhs[0]) || mxIsComplex(prhs[0])){
mexErrMsgIdAndTxt("matrixIn:InvalidType", "Input matrix must be a double, non-complex array.");
}
// extract the data
double const * const matrixAux= static_cast<double const *>(mxGetData(prhs[0]));
// Get matrix size
const mwSize *sizeInputMatrix= mxGetDimensions(prhs[0]);
// allocate array in C. Note: its 1D array, not 3D even if our input is 3D
double* matrixInC= (double*)malloc(sizeInputMatrix[0] *sizeInputMatrix[1] *sizeInputMatrix[2]* sizeof(double));
// MATLAB is column major, not row major (as C). We need to reorder the numbers
// Basically permutes dimensions
// NOTE: the ordering of the loops is optimized for fastest memory access!
// This improves the speed in about 300%
const int size0 = sizeInputMatrix[0]; // Const makes compiler optimization kick in
const int size1 = sizeInputMatrix[1];
const int size2 = sizeInputMatrix[2];
for (int j = 0; j < size2; j++)
{
int jOffset = j*size0*size1; // this saves re-computation time
for (int k = 0; k < size0; k++)
{
int kOffset = k*size1; // this saves re-computation time
for (int i = 0; i < size1; i++)
{
int iOffset = i*size0;
matrixInC[i + jOffset + kOffset] = matrixAux[iOffset + jOffset + k];
}
}
}
// we are done!
// Use your C matrix here
// free memory
free(matrixInC);
return;
}
The relevant concepts to be aware of:
MATLAB matrices are all 1D in memory, no matter how many dimensions they have when used in MATLAB. This is also true for most (if not all) main matrix representation in C/C++ libraries, as allows optimization and faster execution.
You need to explicitly copy matrices from MATLAB to C in a loop.
MATLAB matrices are stored in column major order, as in Fortran, but C/C++ and most modern languages are row major. It is important to permute the input matrix , or else the data will look completely different.
The relevant function in this example are:
mxIsDouble checks if input is double type.
mxIsComplex checks if input is real or imaginary.
mxGetData returns a pointer to the real data in the input array. NULL if there is no real data.
mxGetDimensions returns an pointer to a mwSize array, with the size of the dimension in each index.
Suppose I have an arbitrary size array of integer values that specify the number of elements for each dimension (level) of the array to be allocated, how do I allocate the array without resorting to recursion? It's preferable to do it without recursion to avoid stack overflow.
So, for example, how to complete a function like this:
template <typename Type>
void* allocMulti (int numDim, int* numElementsPerDim)
{
// 'Type' if one-dimensional, should be 'void*' otherwise
void* multiArray = new Type[numElementsPerDim[0]];
// ...
return multiArray;
}
I'm looking for a general algorithm that would cover languages without direct memory access.
If the array is actually a matrix (e.g. length AxB and not a list of arrays of different lengths), then you could allocate a single array of length A*B instead of an array of length A where each position is a pointer to an array of length B.
This could also improve performance, as the memory is continuous (less paging).
You would have to access each cells using a[y * B + x] instead of a[y][x] though (assuming dim(a,0) = A and dim(a,1) = B.
My C++ my be a bit rusty, however, I believe this sort of approach may work:
T* AllocateMatrix(int dims, int[] dimLengths)
{
// Assert dims >= 1
int length = dims[0];
for (int d = 1; d < dims; d++)
length *= dims[d];
return new T[length];
}
*T AccessMatrix(T* matrix, int dims, int[] dimLengths, int[] pos)
{
// Assert dims >= 1
int p = pos[0];
for (int d = 1; d < dims; d++)
{
p = p * dimLengths[d] + pos[d];
}
return &matrix[p];
}
Here's an approach: Allocate the data values as a block, then all the rows of (int *) as a block, then the rows of (int **) as a block, etc.
a) Allocate all the data values as a block. If you have nDim dimensions in the array elementsPerDim, there are prod = product(elementsPerDim, nDim) data values (which you can easily calculate), so you need to allocate:
int prod = product(elementsPerDim, nDim);
int * intblock = calloc(prod, sizeof(int));
b) Allocate all the (int*). Their number is equal to the product of all the dimensions except the last one, so you can simply call your product() function with length nDim-1. So there are product(elementsPerDim, nDim-1) such values, each of size sizeof (int*). Let's allocate them:
int npointers = product(elementsPerDim, nDim-1);
int ** ptrblock = calloc(npointers, sizeof (int *));
Now you must initialize them to point into your block from the previous step. Each pointer gets a non-overlapping block of elementsPerDim[nDim-2] ints, like this:
int rowlength = elementsPerDim[nDim-2];
for (int i=0; i < npointers; i++)
ptrblock[i] = & intblock[i * rowlength]; /* a.k.a. intblock + i*rowlength */
c) Iterate step b backwards until you run out of dimensions. I.e., follow up step (b) with this loop:
void ** prev_block = (void **) ptrblock;
void ** curblock;
for (int d = nDim-2; d > 0; d--) {
int npointers = product(elementsPerDim, d);
curblock = calloc(npointers, sizeof (void **));
int rowlength = elementsPerDim[d-1];
for (int i=0; i < npointers; i++)
curblock[i] = & prev_block[i * rowlength];
prev_block = curblock; /* get ready for the next round */
}
When you're done, curblock will be an array of pointers pointing into the block of second-level pointers, and so on down to the block of ints. You can use normal array notation to dereference them:
ptrblock[3][2][15], etc.
I may have gotten an index off by one somewhere, but this should be the algorithm. You'll notice this is in C, and uses void ** instead of stacking the number of dereferences. You did say you were interested in the algorithm, not in type golf... (It should work as long as all pointers have the same size on your machine.)
I have a rather unexpected issue with one of my functions. Let me explain.
I'm writing a calibration algorithm and since I want to do some grid search (non-continuous optimization), I'm creating my own mesh - different combinations of probabilities.
The size of the grid and the grid itself are computed recursively (I know...).
So in order:
Get variables
Compute corresponding size recursively
Allocate memory for the grid
Pass the empty grid by reference and fill it recursively
The problem I have is after step 4 once I try to retrieve this grid. During step 4, I 'print' on the console the results to check them and everything is fine. I computed several grids with several variables and they all match the results I'm expecting. However, as soon as the grid is taken out of the recursive function, the last column is filled with 0 (all the values from before are replace in this column only).
I tried allocating one extra column for the grid in step 3 but this only made the problem worse (-3e303 etc. values). Also I have the error no matter what size I compute it with (very small to very large), so I assume it isn't a memory error (or at least a 'lack of memory' error). Finally the two functions used and their call have been listed below, this has been quickly programmed, so some variables might seem kind of useless - I know. However I'm always open to your comments (plus I'm no expert in C++ - hence this thread).
void size_Grid_Computation(int nVars, int endPoint, int consideredVariable, int * indexes, int &sum, int nChoices)
{
/** Remember to initialize r at 1 !! - we exclude var_0 and var_(m-1) (first and last variables) in this algorithm **/
int endPoint2 = 0;
if (consideredVariable < nVars - 2)
{
for (indexes[consideredVariable] = 0; indexes[consideredVariable] < endPoint; indexes[consideredVariable] ++)
{
endPoint2 = endPoint - indexes[consideredVariable];
size_Grid_Computation(nVars, endPoint2, consideredVariable + 1, indexes, sum, nChoices);
}
}
else
{
for (int i = 0; i < nVars - 2; i++)
{
sum -= indexes[i];
}
sum += nChoices;
return;
}
}
The above function is for the grid size. Below for the grid itself -
void grid_Creation(double* choicesVector, double** varVector, int consideredVariable, int * indexes, int endPoint, int nVars, int &r)
{
if (consideredVariable > nVars-1)
return;
for (indexes[consideredVariable] = 0; indexes[consideredVariable] < endPoint; indexes[consideredVariable]++)
{
if (consideredVariable == nVars - 1)
{
double sum = 0.0;
for (int j = 0; j <= consideredVariable; j++)
{
varVector[r][j] = choicesVector[indexes[j]];
sum += varVector[r][j];
printf("%lf\t", varVector[r][j]);
}
varVector[r][nVars - 1] = 1 - sum;
printf("%lf row %d\n", varVector[r][nVars - 1],r+1);
r += 1;
}
grid_Creation(choicesVector, varVector, consideredVariable + 1, indexes, endPoint - indexes[consideredVariable], nVars, r);
}
}
Finally the call
#include <stdio.h>
#include <stdlib.h>
int main()
{
int nVars = 5;
int gridPrecision = 3;
int sum1 = 0;
int r = 0;
int size = 0;
int * index, * indexes;
index = (int *) calloc(nVars - 1, sizeof(int));
indexes = (int *) calloc(nVars, sizeof(int));
for (index[0] = 0; index[0] < gridPrecision + 1; index[0] ++)
{
size_Grid_Computation(nVars, gridPrecision + 1 - index[0], 1, index, size, gridPrecision + 1);
}
double * Y;
Y = (double *) calloc(gridPrecision + 1, sizeof(double));
for (int i = 0; i <= gridPrecision; i++)
{
Y[i] = (double) i/ (double) gridPrecision;
}
double ** varVector;
varVector = (double **) calloc(size, sizeof(double *));
for (int i = 0; i < size; i++)
{
varVector[i] = (double *) calloc(nVars, sizeof(double *));
}
grid_Creation(Y, varVector, 0, indexes, gridPrecision + 1, nVars - 1, r);
for (int i = 0; i < size; i++)
{
printf("%lf\n", varVector[i][nVars - 1]);
}
}
I left my barbarian 'printf', they help narrow down the problem. Most likely, I have forgotten or butchered one memory allocation. But I can't see which one. Anyway, thanks for the help!
It seems to me that you have a principal mis-design, namely your 2D array. What you are programming here is not a 2D array but an emulation of it. It only makes sense if you want to have a sort of sparse data structure where you may leave out parts. In your case it looks as if it is just a plain old matrix that you need.
Nowadays it is neither appropriate in C nor in C++ to program like this.
In C, since that seems what you are after, inside functions you declare matrices even with dynamic bounds as
double A[n][m];
If you fear that this could smash your "stack", you may allocate it dynamically
double (*B)[m] = malloc(sizeof(double[n][m]));
You pass such beasts to functions by putting the bounds first in the parameter list
void toto(size_t n, size_t m, double X[n][m]) {
...
}
Once you have clean and readable code, you will find your bug much easier.
I'm new to CUDA, and I been trying to figure out what I'm doing wrong here. CUDA is taking longer than just using the CPU to multiply a matrix. If I'm doing something wrong please let me know.
Here is my code:
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <stdio.h>
#include <stdlib.h>
#include <cstdlib>
#include <assert.h>
#include <time.h>
#define size 100 // Matrix size
#define cols size // Matrix width
#define rows size // Matrix height
void checkCUDAError(const char *msg)
{
cudaError_t err = cudaGetLastError();
if( cudaSuccess != err)
{
fprintf(stderr, "Cuda error: %s: %s.\n", msg, cudaGetErrorString( err) );
exit(EXIT_FAILURE);
}
}
__global__ void matrixMul( int *A, int *B, int *C)
{
int bx = blockIdx.x; // Block index
int tx = threadIdx.x; // Thread index
int ts = blockDim.x; // number of threads
// Declaration of the shared memory C element
extern __shared__ int c_element_sum[];
c_element_sum[tx] = A[tx+((bx/ts)*ts)] * B[(bx%ts)+(tx*ts)];
//Block until all threads in the block have written their data to shared mem
__syncthreads();
int sum;
for(int i=0; i<ts; i++){
if(i==0){
sum=c_element_sum[i];
}
else{
sum+=c_element_sum[i];
}
}
C[bx] = sum;
}
/////////////////////////////////////////////////////////
// Program main
/////////////////////////////////////////////////////////
int main(int argc, char** argv)
{
//create timer.
clock_t t1, t2;
//start timer
t1=clock();
//allocate host memory for matrices
unsigned int size_A = cols * rows;
unsigned int mem_size_A = sizeof(int) * size_A;
int* mA = (int*) malloc(mem_size_A);
unsigned int size_B = cols * rows;
unsigned int mem_size_B = sizeof(int) * size_B;
int* mB = (int*) malloc(mem_size_B);
unsigned int size_C = cols * rows;
unsigned int mem_size_C = sizeof(int) * size_C;
int* mC = (int*) malloc(mem_size_C);
//initialize host memory
for (int i = 0; i < size_A; ++i){
mA[i] = 1;
mB[i] = 1;
mC[i] = 0;
}
// allocate device memory
int* d_mA;
int* d_mB;
int* d_mC;
cudaMalloc((void**) &d_mA, mem_size_A);
cudaMalloc((void**) &d_mB, mem_size_B);
cudaMalloc((void**) &d_mC, mem_size_C);
//copy host memory to device (A and B)
cudaMemcpy(d_mA, mA, mem_size_A, cudaMemcpyHostToDevice);
cudaMemcpy(d_mB, mB, mem_size_B, cudaMemcpyHostToDevice);
cudaMemcpy(d_mC, mC, mem_size_C, cudaMemcpyHostToDevice);
// setup execution parameters
int numThreadsPerBlock = cols;
int numBlocks = (cols * rows);
int sharedMemSize = numThreadsPerBlock * sizeof(int);
dim3 dimGrid(numBlocks);
dim3 dimBlock(numThreadsPerBlock);
// execute the kernel
matrixMul <<< dimGrid, dimBlock, sharedMemSize >>>(d_mA, d_mB, d_mC);
//Block until device has completed
cudaThreadSynchronize();
// check if kernel execution generated an error
// Check for any CUDA errors
checkCUDAError("kernel invocation");
//copy result from device to host
cudaMemcpy(mC, d_mC, mem_size_C, cudaMemcpyDeviceToHost);
// Check for any CUDA errors
checkCUDAError("memcpy");
//stop timer
t2 = clock();
//check results
for (int i = 0; i < size_C; ++i){
assert(mC[i] == cols);
}
//clean up memory
free(mA);
free(mB);
free(mC);
cudaFree(d_mA);
cudaFree(d_mB);
cudaFree(d_mC);
printf("WITH CUDA - clocks: %d \n\n", t2-t1);
//////////////////////////////
///////// CPU ONLY //////////
/////////////////////////////
//create timer.
clock_t cpu_t1, cpu_t2;
//start timer
cpu_t1=clock();
//allocate host memory for matrices
unsigned int cpu_size_A = cols * rows;
unsigned int cpu_mem_size_A = sizeof(int) * cpu_size_A;
int* cpu_mA = (int*) malloc(cpu_mem_size_A);
unsigned int cpu_size_B = cols * rows;
unsigned int cpu_mem_size_B = sizeof(int) * cpu_size_B;
int* cpu_mB = (int*) malloc(cpu_mem_size_B);
unsigned int cpu_size_C = cols * rows;
unsigned int cpu_mem_size_C = sizeof(int) * cpu_size_C;
int* cpu_mC = (int*) malloc(cpu_mem_size_C);
//initialize host memory
for (int i = 0; i < cpu_size_A; ++i){
cpu_mA[i] = 1;
cpu_mB[i] = 1;
cpu_mC[i] = 0;
}
int ts = cols;
for(int bx=0; bx<(cols*rows);bx++){
int sum = 0;
for(int tx=0; tx<cols; tx++){
sum += cpu_mA[tx+((bx/ts)*ts)] * cpu_mB[(bx%ts)+(tx*ts)];
}
cpu_mC[bx]=sum;
}
//stop timer
cpu_t2 = clock();
//check results
for (int i = 0; i < cpu_size_C; ++i){
assert(cpu_mC[i] == cols);
}
//clean up memory
free(cpu_mA);
free(cpu_mB);
free(cpu_mC);
printf("CPU ONLY - clocks: %d \n\n", cpu_t2-cpu_t1);
return 0;
}
Based on your program, this is expected. Your timer looks like it clocks the entire execution of the program, which would include copying to the device, computation time, and copying the results back. Given the rather small workload you've provided for the program (100x100 matrices), the overhead of the memory copies far outweighs any computational benefit you get when doing the computation with the kernel. Your kernel itself is also not the most efficient implementation.
I don't think you're doing anything wrong, it's just that you haven't provided a large enough chunk of work for the GPU and you could potentially further optimize your kernel. Note that simply scaling up the size of the chunk may not significantly improve the performance with respect to the CPU, since you would also be scaling up the memory management time. While it is relatively simple to write a first implementation of a program on CUDA, it is significantly more difficult to get good performance out of it. The most effective way to use CUDA is to have a high ratio of compute to memory transactions. For example, having a pipeline of several compute-intensive kernels to operate successively on a chunk of data, only needing host-device copying at the beginning and end.
If this is just a program to help you learn to code for CUDA, this is a great step and getting a deep understanding of how to optimize matrix multiplication kernels will serve you well in many other cases. If you are writing this kernel for use in a production piece of software, I would recommend you use the highly-optimized linear algebra library CUBLAS: http://developer.nvidia.com/cublas (or some other library where the hard work has been done for you already).