2d char array to CUDA kernel - c++

I need help with transfer char[][] to Cuda kernel. This is my code:
__global__
void kernel(char** BiExponent){
for(int i=0; i<500; i++)
printf("%c",BiExponent[1][i]); // I want print line 1
}
int main(){
char (*Bi2dChar)[500] = new char [5000][500];
char **dev_Bi2dChar;
...//HERE I INPUT DATA TO Bi2dChar
size_t host_orig_pitch = 500 * sizeof(char);
size_t pitch;
cudaMallocPitch((void**)&dev_Bi2dChar, &pitch, 500 * sizeof(char), 5000);
cudaMemcpy2D(dev_Bi2dChar, pitch, Bi2dChar, host_orig_pitch, 500 * sizeof(char), 5000, cudaMemcpyHostToDevice);
kernel <<< 1, 512 >>> (dev_Bi2dChar);
free(Bi2dChar); cudaFree(dev_Bi2dChar);
}
I use:
nvcc.exe" -gencode=arch=compute_20,code=\"sm_20,compute_20\" --use-local-env --cl-version 2012 -ccbin
Thanks for help.

cudaMemcpy2D doesn't actually handle 2-dimensional (i.e. double pointer, **) arrays in C.
Note that the documentation indicates it expects single pointers, not double pointers.
Generally speaking, moving arbitrary double pointer C arrays between the host and the device is more complicated than a single pointer array.
If you really want to handle the double-pointer array, then search on "CUDA 2D Array" in the upper right hand corner of this page, and you'll find various examples of how to do it. (For example, the answer given by #talonmies here)
Often, an easier approach is simply to "flatten" the array so it can be referenced by a single pointer, i.e. char[] instead of char[][], and then use index arithmetic to simulate 2-dimensional access.
Your flattened code would look something like this:
(the code you provided is an uncompilable, incomplete snippet, so mine is also)
#define XDIM 5000
#define YDIM 500
__global__
void kernel(char* BiExponent){
for(int i=0; i<500; i++)
printf("%c",BiExponent[(1*XDIM)+i]); // I want print line 1
}
int main(){
char (*Bi2dChar)[YDIM] = new char [XDIM][YDIM];
char *dev_Bi2dChar;
...//HERE I INPUT DATA TO Bi2dChar
cudaMalloc((void**)&dev_Bi2dChar,XDIM*YDIM * sizeof(char));
cudaMemcpy(dev_Bi2dChar, &(Bi2dChar[0][0]), host_orig_pitch, XDIM*YDIM * sizeof(char), cudaMemcpyHostToDevice);
kernel <<< 1, 512 >>> (dev_Bi2dChar);
free(Bi2dChar); cudaFree(dev_Bi2dChar);
}
If you want a pitched array, you can create it similarly, but you will still do so as single pointer arrays, not double pointer arrays.

You can't use printf in a Cuda kernel. The reason being is that the code is being executed on the GPU and not on the host CPU.
You can, however use cuPrintf
How do we use cuPrintf()?

Related

Fast copying contiguous array of arrays

I am trying to copy from an array of arrays, to another one, while leaving a space between arrays in the target.
They are both contiguous each vector sizes size is between 5000 and 52000 floats,
Output_jump is the vector size times eight, and vector_count vary in my tests.
I did the best I learned here https://stackoverflow.com/a/34450588/1238848 and here https://stackoverflow.com/a/16658555/1238848
but still it seems so slow.
void copyToTarget(const float *input, float *output, int vector_count, int vector_size, int output_jump)
{
int left_to_do,offset;
constexpr int block=2048;
constexpr int blockInBytes = block*sizeof(float);
float temp[2048];
for (int i = 0; i < vector_count; ++i)
{
left_to_do = vector_size;
offset = 0;
while(left_to_do > block)
{
memcpy(temp, input, blockInBytes);
memcpy(output, temp, blockInBytes);
left_to_do -= block;
input += block;
output += block;
}
if (left_to_do)
{
memcpy(temp, input, left_to_do*sizeof(float));
memcpy(output, temp, left_to_do*sizeof(float));
input += left_to_do;
output += left_to_do;
}
output += output_jump;
}
}
I'm skeptical of the answer you linked, which encourages avoiding a function call to memcpy. Surely the implementation of memcpy is very well optimized, probably hand written in assembly, and therefore hard to beat! Moreover for large-sized copies, the function call overhead is negligible compared to memory access latency. So simply calling memcpy is likely the fastest way to copy contiguous bytes around in memory.
If output_jump were zero, a single call to memcpy can copy input directly to output (and this would be hard to beat). For nonzero output_jump, the copy needs to be divided up over the contiguous vectors. Use one memcpy per vector, without the temp buffer, copying directly from input + i * vector_size to output + i * (vector_size + output_jump).
But better yet, like the top answer on that thread suggests, try if possible to find a way to avoid copying data in the first place.

SIGSEGV in CUDA allocation

I've an host array of uint64_t of size spectrum_size and I need to allocate and copy it on my GPU.
But when I'm trying to allocate this in the GPU memory, but I continue to receive SIGSEGV... Any ideas?
uint64_t * gpu_hashed_spectrum;
uint64_t * gpu_hashed_spectrum_h = new uint64_t [spectrum_size];
HANDLE_ERROR(cudaMalloc((void **)&gpu_hashed_spectrum, sizeof(uint64_t *) * spectrum_size));
for(i=0; i<spectrum_size; i++) {
HANDLE_ERROR(cudaMalloc((void **)&gpu_hashed_spectrum_h[i], sizeof(uint64_t)));
}
printf("\t\t...Copying\n");
for(i=0; i<spectrum_size; i++) {
HANDLE_ERROR(cudaMemcpy((void *)gpu_hashed_spectrum_h[i], (const void *)hashed_spectrum[i], sizeof(uint64_t), cudaMemcpyHostToDevice));
}
HANDLE_ERROR(cudaMemcpy(gpu_hashed_spectrum, gpu_hashed_spectrum_h, spectrum_size * sizeof(uint64_t *), cudaMemcpyHostToDevice));
Full code available here
UPDATE:
I tried to do in this way, noy I've got SIGSEGV on other parts of the code (in the kernel, when using this array. Maybe is due to other errors.
uint64_t * gpu_hashed_spectrum;
HANDLE_ERROR(cudaMalloc((void **)&gpu_hashed_spectrum, sizeof(uint64_t) * spectrum_size));
HANDLE_ERROR(cudaMemcpy(gpu_hashed_spectrum, hashed_spectrum, spectrum_size * sizeof(uint64_t), cudaMemcpyHostToDevice));
At least you are confused about uint64_t** and unit64_t*.
At line 1, you define gpu_hashed_spectrum as a pointer, pointing to some data of the type unit64_t, but at line 3
HANDLE_ERROR(cudaMalloc((void **)&gpu_hashed_spectrum, sizeof(uint64_t *) * spectrum_size));
you use gpu_hashed_spectrum as a pointer, pointing to some data of the type unit64_t*.
Maybe you should change your definition to
uint64_t** gpu_hashed_spectrum;
As well as some other lines.

C array with float data crashing in Objective C class (EXC_BAD_ACCESS)

I am doing some audio processing and therefore mixing some C and Objective C. I have set up a class that handles my OpenAL interface and my audio processing. I have changed the class suffix to
.mm
...as described in the Core Audio book among many examples online.
I have a C style function declared in the .h file and implemented in the .mm file:
static void granularizeWithData(float *inBuffer, unsigned long int total) {
// create grains of audio data from a buffer read in using ExtAudioFileRead() method.
// total value is: 235377
float tmpArr[total];
// now I try to zero pad a new buffer:
for (int j = 1; j <= 100; j++) {
tmpArr[j] = 0;
// CRASH on first iteration EXC_BAD_ACCESS (code=1, address= ...blahblah)
}
}
Strange??? Yes I am totally out of ideas as to why THAT doesn't work but the FOLLOWING works:
float tmpArr[235377];
for (int j = 1; j <= 100; j++) {
tmpArr[j] = 0;
// This works and index 0 - 99 are filled with zeros
}
Does anyone have any clue as to why I can't declare an array of size 'total' which has an int value? My project uses ARC, but I don't see why this would cause a problem. When I print the value of 'total' when debugging, it is in fact the correct value. If anyone has any ideas, please help, it is driving me nuts!
Problem is that that array gets allocated on the stack and not on the heap. Stack size is limited so you can't allocate an array of 235377*sizeof(float) bytes on it, it's too large. Use the heap instead:
float *tmpArray = NULL;
tmpArray = (float *) calloc(total, sizeof(float)); // allocate it
// test that you actually got the memory you asked for
if (tmpArray)
{
// use it
free(tmpArray); // release it
}
Mind that you are always responsible of freeing memory which is allocated on the heap or you will generate a leak.
In your second example, since size is known a priori, the compiler reserves that space somewhere in the static space of the program thus allowing it to work. But in your first example it must do it on the fly, which causes the error. But in any case before being sure that your second example works you should try accessing all the elements of the array and not just the first 100.

writing Bytes into a .Bin file

I have a vector in C++ that I want to write it to a .bin file.
this vector's type is byte, and the number of bytes could be huge, maybe millions.
I am doing it like this:
if (depthQueue.empty())
return;
FILE* pFiledep;
pFiledep = fopen("depth.bin", "wb");
if (pFiledep == NULL)
return;
byte* depthbuff = (byte*) malloc(depthQueue.size() * 320 * 240 * sizeof(byte));
if(depthbuff)
{
for(int m = 0; m < depthQueue.size(); m++)
{
byte b = depthQueue[m];
depthbuff[m] = b;
}
fwrite(depthbuff, sizeof(byte),
depthQueue.size() * 320 * 240 * sizeof(byte), pFiledep);
fclose(pFiledep);
free(depthbuff);
}
depthQueue is my vector which contains bytes and lets say its size is 100,000.
Sometimes I don't get this error, but the bin file is empty.
Sometime I get heap error.
Somtimes when I debug this, it seems that malloc doesn't allocate the space.
Is the problem is with space?
Or is chunk of sequential memory is so long and it can't write in bin?
You don't need hardly any of that. vector contents are guaranteed to be contiguous in memory, so you can just write from it directly:
fwrite(&depthQueue[0], sizeof (Byte), depthQueue.size(), pFiledep);
Note a possible bug in your code: if the vector is indeed vector<Byte>, then you should not be multiplying its size by 320*240.
EDIT: More fixes to the fwrite() call: The 2nd parameter already contains the sizeof (Byte) factor, so don't do that multiplication again in the 3rd parameter either (even though sizeof (Byte) is probably 1 so it doesn't matter).

Why is memset() incorrectly initializing int?

Why is the output of the following program 84215045?
int grid[110];
int main()
{
memset(grid, 5, 100 * sizeof(int));
printf("%d", grid[0]);
return 0;
}
memset sets each byte of the destination buffer to the specified value. On your system, an int is four bytes, each of which is 5 after the call to memset. Thus, grid[0] has the value 0x05050505 (hexadecimal), which is 84215045 in decimal.
Some platforms provide alternative APIs to memset that write wider patterns to the destination buffer; for example, on OS X or iOS, you could use:
int pattern = 5;
memset_pattern4(grid, &pattern, sizeof grid);
to get the behavior that you seem to expect. What platform are you targeting?
In C++, you should just use std::fill_n:
std::fill_n(grid, 100, 5);
memset(grid, 5, 100 * sizeof(int));
You are setting 400 bytes, starting at (char*)grid and ending at (char*)grid + (100 * sizeof(int)), to the value 5 (the casts are necessary here because memset deals in bytes, whereas pointer arithmetic deals in objects.
84215045 in hex is 0x05050505; since int (on your platform/compiler/etc.) is represented by four bytes, when you print it, you get "four fives."
memset is about setting bytes, not values. One of the many ways to set array values in C++ is std::fill_n:
std::fill_n(grid, 100, 5);
Don't use memset.
You set each byte [] of the memory to the value of 5. Each int is 4 bytes long [5][5][5][5], which the compiler correctly interprets as 5*256*256*256 + 5*256*256 + 5*256 + 5 = 84215045. Instead, use a for loop, which also doesn't require sizeof(). In general, sizeof() means you're doing something the hard way.
for(int i=0; i<110; ++i)
grid[i] = 5;
Well, the memset writes bytes, with the selected value. Therefore an int will look something like this:
00000101 00000101 00000101 00000101
Which is then interpreted as 84215045.
You haven't actually said what you want your program to do.
Assuming that you want to set each of the first 100 elements of grid to 5 (and ignoring the 100 vs. 110 discrepancy), just do this:
for (int i = 0; i < 100; i ++) {
grid[i] = 5;
}
I understand that you're concerned about speed, but your concern is probably misplaced. On the one hand, memset() is likely to be optimized and therefore faster than a simple loop. On the other hand, the optimization is likely to consist of writing more than one byte at a time, which is what this loop does. On the other other hand, memset() is a loop anyway; writing the loop explicitly rather than burying it in a function call doesn't change that. On the other other other hand, even if the loop is slow, it's not likely to matter; concentrate on writing clear code, and think about optimizing it if actual measurements indicate that there's a significant performance issue.
You've spent many orders of magnitude more time writing the question than your computer will spend setting grid.
Finally, before I run out of hands (too late!), it doesn't matter how fast memset() is if it doesn't do what you want. (Not setting grid at all is even faster!)
If you type man memset on your shell, it tells you that
void * memset(void *b, int c, size_t len)
A plain English explanation of this would be, it fills a byte string b of length len with each byte a value c.
For your case,
memset(grid, 5, 100 * sizeof(int));
Since sizeof(int)==4, thus the above code pieces looked like:
for (int i=0; i<100; i++)
grid[i]=0x05050505;
OR
char *grid2 = (char*)grid;
for (int i=0; i<100*sizeof(int); i++)
grid2[i]=0x05;
It would print out 84215045
But in most C code, we want to initialize a piece of memory block to value zero.
char type --> \0 or NUL
int type --> 0
float type --> 0.0f
double type --> 0.0
pointer type --> nullptr
And either gcc or clang etc. modern compilers can take well care of this for you automatically.
// variadic length array (VLA) introduced in C99
int len = 20;
char carr[len];
int iarr[len];
float farr[len];
double darr[len];
memset(carr, 0, sizeof(char)*len);
memset(iarr, 0, sizeof(int)*len);
memset(farr, 0, sizeof(float)*len);
memset(darr, 0, sizeof(double)*len);
for (int i=0; i<len; i++)
{
printf("%2d: %c\n", i, carr[i]);
printf("%2d: %i\n", i, iarr[i]);
printf("%2d: %f\n", i, farr[i]);
printf("%2d: %lf\n", i, darr[i]);
}
But be aware, C ISO Committee does not imposed such definitions, it is compiler-specific.
Since the memset writes bytes,I usually use it to set an int array to zero like:
int a[100];
memset(a,0,sizeof(a));
or you can use it to set a char array,since a char is exactly a byte:
char a[100];
memset(a,'*',sizeof(a));
what's more,an int array can also be set to -1 by memset:
memset(a,-1,sizeof(a));
This is because -1 is 0xffffffff in int,and is 0xff in char(a byte).
This code has been tested. Here is a way to memset an "Integer" array to a value between 0 to 255.
MinColCost=new unsigned char[(Len+1) * sizeof(int)];
memset(MinColCost,0x5,(Len+1)*sizeof(int));
memset(MinColCost,0xff,(Len+1)*sizeof(int));