How to copy data from unsigned int to ulong4 in CUDA - c++

.h file:
#define VECTOR_SIZE 1024
.cpp file:
int main ()
{
unsigned int* A;
A = new unsigned int [VECTOR_SIZE];
CopyToDevice (A);
}
.cu file:
void CopyToDevice (unsigned int *A)
{
ulong4 *UA
unsigned int VectorSizeUlong4 = VECTOR_SIZE / 4;
unsigned int VectorSizeBytesUlong4 = VectorSizeUlong4 * sizeof(ulong4);
cudaMalloc( (void**)&UA, VectorSizeBytesUlong4 );
// how to use cudaMemcpy to copy data from A to UA?
// I tried to do the following but it gave access violation error:
for (int i=0; i<VectorSizeUlong4; ++i)
{
UA[i].x = A[i*4 + 0];
UA[i].y = A[i*4 + 1];
UA[i].z = A[i*4 + 2];
UA[i].w = A[i*4 + 3];
}
// I also tried to copy *A to device and then work on it instead going back to CPU to access *A every time but this did not work again
}

The CUDA ulong4 is a 16 byte aligned structure defined as
struct __builtin_align__(16) ulong4
{
unsigned long int x, y, z, w;
};
this means that the stream of four consecutive 32 bit unsigned source integers you want to use to populate a stream of ulong4 are the same size. The simplest solution is contained right in the text on the image you posted - just cast (either implicitly or explicitly) the unsigned int pointer to a ulong4 pointer, use cudaMemcpydirectly on the host and device memory, and pass the resulting device pointer to whatever kernel function you have that requires a ulong4 input. Your device transfer function could look something like:
ulong4* CopyToDevice (unsigned int* A)
{
ulong4 *UA, *UA_h;
size_t VectorSizeUlong4 = VECTOR_SIZE / 4;
size_t VectorSizeBytesUlong4 = VectorSizeUlong4 * sizeof(ulong4);
cudaMalloc( (void**)&UA, VectorSizeBytesUlong4);
UA_h = reinterpret_cast<ulong4*>(A); // not necessary but increases transparency
cudaMemcpy(UA, UA_h, VectorSizeBytesUlong4);
return UA;
}
[Usual disclaimer: written in browser, not tested or compiled, use at own risk]

This should raise all alarm bells:
cudaMalloc( (void**)&UA, VectorSizeBytesUlong4 );
// ...
UA[i].x = A[i*4 + 0];
You are allocating UA on the device and then use it in host code. Don't ever do that. You will need to use cudaMemcpy to copy arrays to the device. This tutorial shows you a basic program that uses cudaMemcpy to copy things over. The length argument to cudaMemcpy is the length of your array in bytes. And in your case that is VECTOR_SIZE * sizeof(unsigned int).

Related

SIGSEGV in CUDA allocation

I've an host array of uint64_t of size spectrum_size and I need to allocate and copy it on my GPU.
But when I'm trying to allocate this in the GPU memory, but I continue to receive SIGSEGV... Any ideas?
uint64_t * gpu_hashed_spectrum;
uint64_t * gpu_hashed_spectrum_h = new uint64_t [spectrum_size];
HANDLE_ERROR(cudaMalloc((void **)&gpu_hashed_spectrum, sizeof(uint64_t *) * spectrum_size));
for(i=0; i<spectrum_size; i++) {
HANDLE_ERROR(cudaMalloc((void **)&gpu_hashed_spectrum_h[i], sizeof(uint64_t)));
}
printf("\t\t...Copying\n");
for(i=0; i<spectrum_size; i++) {
HANDLE_ERROR(cudaMemcpy((void *)gpu_hashed_spectrum_h[i], (const void *)hashed_spectrum[i], sizeof(uint64_t), cudaMemcpyHostToDevice));
}
HANDLE_ERROR(cudaMemcpy(gpu_hashed_spectrum, gpu_hashed_spectrum_h, spectrum_size * sizeof(uint64_t *), cudaMemcpyHostToDevice));
Full code available here
UPDATE:
I tried to do in this way, noy I've got SIGSEGV on other parts of the code (in the kernel, when using this array. Maybe is due to other errors.
uint64_t * gpu_hashed_spectrum;
HANDLE_ERROR(cudaMalloc((void **)&gpu_hashed_spectrum, sizeof(uint64_t) * spectrum_size));
HANDLE_ERROR(cudaMemcpy(gpu_hashed_spectrum, hashed_spectrum, spectrum_size * sizeof(uint64_t), cudaMemcpyHostToDevice));
At least you are confused about uint64_t** and unit64_t*.
At line 1, you define gpu_hashed_spectrum as a pointer, pointing to some data of the type unit64_t, but at line 3
HANDLE_ERROR(cudaMalloc((void **)&gpu_hashed_spectrum, sizeof(uint64_t *) * spectrum_size));
you use gpu_hashed_spectrum as a pointer, pointing to some data of the type unit64_t*.
Maybe you should change your definition to
uint64_t** gpu_hashed_spectrum;
As well as some other lines.

Getting wrong data back when reading from binary file

I'm having an issue reading in some bytes from a yuv file (it's 1280x720 if that matters) and was hoping someone could point out what I'm doing wrong. I'm getting different results using the read command and using an istream iterator . Here's some example code of what I'm trying to do:
void readBlock(std::ifstream& yuvFile, YUVBlock& destBlock, YUVConfig& config, const unsigned int x, const unsigned int y, const bool useAligned = false)
{
//Calculate luma offset
unsigned int YOffset = (useAligned ? config.m_alignedYFileOffset : config.m_YFileOffset) +
(destBlock.yY * (useAligned ? config.m_alignedYUVWidth : config.m_YUVWidth) + destBlock.yX);// *config.m_bitDepth;
//Copy Luma data
//yuvFile.seekg(YOffset, std::istream::beg);
for (unsigned int lumaY = 0; lumaY < destBlock.m_YHeight && ((lumaY + destBlock.yY) < config.m_YUVHeight); ++lumaY)
{
yuvFile.seekg(YOffset + ((useAligned ? config.m_alignedYUVWidth : config.m_YUVWidth)/* * config.m_bitDepth*/) * (lumaY), std::istream::beg);
int copySize = destBlock.m_YWidth;
if (destBlock.yX + copySize > config.m_YUVWidth)
{
copySize = config.m_YUVWidth - destBlock.yX;
}
if (destBlock.yX >= 1088 && destBlock.yY >= 704)
{
char* test = new char[9];
yuvFile.read(test, 9);
delete[] test;
yuvFile.seekg(YOffset + ((useAligned ? config.m_alignedYUVWidth : config.m_YUVWidth)/* * config.m_bitDepth*/) * (lumaY));
}
std::istream_iterator<uint8_t> start = std::istream_iterator<uint8_t>(yuvFile);
std::copy_n(start, copySize, std::back_inserter(destBlock.m_yData));
}
}
struct YUVBlock
{
std::vector<uint8_t> m_yData;
std::vector<uint8_t> m_uData;
std::vector<uint8_t> m_vData;
unsigned int m_YWidth;
unsigned int m_YHeight;
unsigned int m_UWidth;
unsigned int m_UHeight;
unsigned int m_VWidth;
unsigned int m_VHeight;
unsigned int yX;
unsigned int yY;
unsigned int uX;
unsigned int uY;
unsigned int vX;
unsigned int vY;
};
This error only seems to be happening at X =1088 and Y = 704 in the image. I'm expecting to see a byte value of 10 as the first byte I read back. When I use
yuvFile.read(test, 9);
I get 10 as my first byte. When I use the istream iterator:
std::istream_iterator<uint8_t> start = std::istream_iterator<uint8_t>(yuvFile);
std::copy_n(start, copySize, std::back_inserter(destBlock.m_yData));
The first byte I read is 17. 17 is the byte after 10 so it seems the istream iterator skips the first byte.
Any help would be appreciated
There is a major difference between istream::read and std::istream_iterator.
std::istream::read performs unformatted read.
std::istream_iterator performs formatted read.
From http://en.cppreference.com/w/cpp/iterator/istream_iterator
std::istream_iterator is a single-pass input iterator that reads successive objects of type T from the std::basic_istream object for which it was constructed, by calling the appropriate operator>>.
If your file was created using std::ostream::write or fwrite, you must use std::istream::read or fread to read the data.
If your file was created using any of the methods that create formatted output, such as std::ostream::operato<<(), fprintf, you have a chance to read the data using std::istream_iterator.

CUDA kernel function output variable isn't modified

I am trying to pass object to kernel. This object has basically two variables, one acts as the input and the other as the output of the kernel. But when I launch kernel the output variable does not change. But when I add another variable to kernel and assign the output value to this variable as well, it suddenly works for both of them.
I've read in another thread (While loop fails in CUDA kernel) that the compiler can evaluate kernel as empty for optimizing purposes if it doesn't produce any output.
So it is possible that this input/output object that I'm passing as the only kernel argument isn't somehow recognized by the compiler as an output? And if that's true. Is there an elegant way (I would like to avoid adding another kernel argument) such as compiling option that can prevent this?
This is the class for this object.
class Replica
{
public :
signed char gA[1024];
int MA;
__device__ __host__ Replica(){
}
};
And this is the kernel that is basically a sum reduction.
__global__ void sumKerA(Replica* Rd)
{
int t = threadIdx.x;
int b = blockIdx.x;
__shared__ signed short gAs[1024];
gAs[t] = Rd[b].gA[t];
for (unsigned int stride = 1024 >> 1; stride > 0; stride >>= 1){
__syncthreads();
if (t < stride){
gAs[t] += gAs[t + stride];
}
}
__syncthreads();
if (t == 0){
Rd[b].MA = gAs[0];
}
}
And finally my host code.
int main ()
{
// replicas - array of objects
Replica R[128];
for (int i = 0; i < 128; ++i){
for (int j = 0; j < 1024; ++j){
R[i].gA[j] = 2*(rand() % 2) - 1;
}
R[i].MA = 0;
}
Replica* Rd;
cudaSetDevice(0);
cudaMalloc((void **)&Rd,128*sizeof(Replica));
cudaMemcpy(Rd,R,128*sizeof(Replica),cudaMemcpyHostToDevice);
dim3 DimBlock(1024,1,1);
dim3 DimGridA(128,1,1);
sumKerA <<< DimBlock, DimGridA >>> (Rd);
cudaThreadSynchronize();
cudaMemcpy(&R,Rd,128*sizeof(Replica),cudaMemcpyDeviceToHost);
// cudaMemcpy(&M,Md,128*sizeof(int),cudaMemcpyDeviceToHost);
for (int i = 0; i < 128; ++i){
cout << R[i].MA << " ";
}
cudaFree(Rd);
return 0;
}
Based on your reduction code, it appears that you intend to launch 1024 threads per block.
In that case, this is incorrect:
dim3 DimBlock(1024,1,1);
dim3 DimGridA(128,1,1);
sumKerA <<< DimBlock, DimGridA >>> (Rd);
The first kernel configuration parameter is the dimensions of the grid. The second parameter is the dimension of the threadblock. If you want 1024 threads per block, while launching 128 blocks, your kernel launch should look like this:
sumKerA <<< DimGridA, DimBlock >>> (Rd);
If you add proper cuda error checking to your code, I expect you would see a kernel launch failure, because using the block variable (blockIdx.x) to index into the Rd array of 128 elements would index beyond the end of the array, in your original case.
If you modify the Replica objects pointed to by Rd in your kernel, that is externally visible state, so any code that modifies those objects cannot be "optimized away" by the compiler.
Also note that cudaThreadSynchronize() is deprecated in favor of cudaDeviceSynchronize() (they have the same behavior.)

C++ slow read/seekg

In my program I read in a file (here only a test file of about 200k data points afterwards there will be millions.) Now what I do is:
for (int i=0;i<n;i++) {
fid.seekg(4,ios_base::cur);
fid.read((char*) &x[i],8);
fid.seekg(8,ios_base::cur);
fid.read((char*) &y[i],8);
fid.seekg(8,ios_base::cur);
fid.read((char*) &z[i],8);
fid.read((char*) &d[i],8);
d[i] = (d[i] - p)/p;
z[i] *= cc;
}
Whereby n denotes the number of points to read in.
Afterwards I write them again with
for(int i=0;i<n;i++){
fid.write((char*) &d[i],8);
fid.write((char*) &z[i],8);
temp = (d[i] + 1) * p;
fid.write((char*) &temp,8);
}
Whereby the writing is faster then the reading.(time measured with clock_t)
My Question is now. Have I done some rather stupid mistake with the reading or can this behavior be expected?
I'm using Win XP with a magnetic drive.
yours magu_
You're using seekg too often. I see that you're using it to skip bytes, but you could as well read the complete buffer and then skip the bytes in the buffer:
char buffer[52];
for (int i=0;i<n;i++) {
fid.read(buffer, sizeof(buffer));
memcpy(&x[i], &buffer[4], sizeof(x[i]));
memcpy(&y[i], &buffer[20], sizeof(y[i]));
// etc
}
However, you can define a struct that represents the data in your file:
#pragma pack(push, 1)
struct Item
{
char dummy1[4]; // skip 4 bytes
__int64 x;
char dummy2[8]; // skip 8 bytes
__int64 y;
char dummy3[8]; // skip 8 bytes
__int64 z;
__int64 d;
};
#pragma pack(pop)
then declare an array of those structs and read all data at once:
Item* items = new Item[n];
fid.read(items, n * sizeof(Item)); // read all data at once will be amazing fast
(remark: I don't know the types of x, y, z and d, so I assume __int64 here)
I personally would (at least) do this:
for (int i=0;i<n;i++) {
char dummy[8];
fid.read(dummy,4);
fid.read((char*) &x[i],8);
fid.read(dummy,8);
fid.read((char*) &y[i],8);
fid.read(dummy,8);
fid.read((char*) &z[i],8);
fid.read((char*) &d[i],8);
d[i] = (d[i] - p)/p;
z[i] *= cc;
}
Doing a struct, or reading large amounts of data in one go (say adding a second layer, where you read 4KB at a time, and then using a pair of functions that do "skip" and "fetch" of the different fields would be a bit more work, but likely much faster).
Another option is to use mmap in Linux or MapViewOfFile in Windows. This method reduces the overhead in reading a file by a small portion, since there is one less copy required to transfer the data to the application.
Edit: I should add "Make sure you make comparative measurements", and if your application is meant to run on many machines, make sure you make measurements on more than one type of machine, with different alternatives of disk drive, processor and memory. You don't really want to tweak the code so that it runs 50% faster on your machine, but 25% slower on another machine.
The assert() statements are the most important part of this code so that if your platform ever changes and the width of your native types change then the assertions will fail. Instead of seeking, I would read to a dummy area. The p* variables make the code easier to read, IMO.
assert(sizeof x[0] == 8);
assert(sizeof y[0] == 8);
assert(sizeof z[0] == 8);
assert(sizeof d[0] == 8);
for (int i=0;i<n;i++) {
char unused[8];
char * px = (char *) &x[i];
char * py = (char *) &y[i];
char * pz = (char *) &z[i];
char * pd = (char *) &d[i];
fid.read(unused, 4);
fid.read(px, 8);
fid.read(unused, 8);
fid.read(py, 8);
fid.read(unused, 8);
fid.read(pz, 8);
fid.read(pd, 8);
d[i] = (d[i] - p)/p;
z[i] *= cc;
}

While loop fails in CUDA kernel

I am using GPU to do some calculation for processing words.
Initially, I used one block (with 500 threads) to process one word.
To process 100 words, I have to loop the kernel function 100 times in my main function.
for (int i=0; i<100; i++)
kernel <<< 1, 500 >>> (length_of_word);
My kernel function looks like this:
__global__ void kernel (int *dev_length)
{
int length = *dev_length;
while (length > 4)
{ //do something;
length -=4;
}
}
Now I want to process all 100 words at the same time.
Each block will still have 500 threads, and processes one word (per block).
dev_totalwordarray: store all characters of the words (one after another)
dev_length_array: store the length of each word.
dev_accu_length: stores the accumulative length of the word (total char of all previous words)
dev_salt_ is an array of of size 500, storing unsigned integers.
Hence, in my main function I have
kernel2 <<< 100, 500 >>> (dev_totalwordarray, dev_length_array, dev_accu_length, dev_salt_);
to populate the cpu array:
for (int i=0; i<wordnumber; i++)
{
int length=0;
while (word_list_ptr_array[i][length]!=0)
{
length++;
}
actualwordlength2[i] = length;
}
to copy from cpu -> gpu:
int* dev_array_of_word_length;
HANDLE_ERROR( cudaMalloc( (void**)&dev_array_of_word_length, 100 * sizeof(int) ) );
HANDLE_ERROR( cudaMemcpy( dev_array_of_word_length, actualwordlength2, 100 * sizeof(int),
My function kernel now looks like this:
__global__ void kernel2 (char* dev_totalwordarray, int *dev_length_array, int* dev_accu_length, unsigned int* dev_salt_)
{
tid = threadIdx.x + blockIdx.x * blockDim.x;
unsigned int hash[N];
int length = dev_length_array[blockIdx.x];
while (tid < 50000)
{
const char* itr = &(dev_totalwordarray[dev_accu_length[blockIdx.x]]);
hash[tid] = dev_salt_[threadIdx.x];
unsigned int loop = 0;
while (length > 4)
{ const unsigned int& i1 = *(reinterpret_cast<const unsigned int*>(itr)); itr += sizeof(unsigned int);
const unsigned int& i2 = *(reinterpret_cast<const unsigned int*>(itr)); itr += sizeof(unsigned int);
hash[tid] ^= (hash[tid] << 7) ^ i1 * (hash[tid] >> 3) ^ (~((hash[tid] << 11) + (i2 ^ (hash[tid] >> 5))));
length -=4;
}
tid += blockDim.x * gridDim.x;
}
}
However, kernel2 doesn't seem to work at all.
It seems while (length > 4) causes this.
Does anyone know why? Thanks.
I am not sure if the while is the culprit, but I see few things in your code that worry me:
Your kernel produces no output. The optimizer will most likely detect this and convert it to an empty kernel
In almost no situation you want arrays allocated per-thread. That will consume a lot of memory. Your hash[N] table will be allocated per-thread and discarded at the end of the kernel. If N is big (and then multiplied by the total amount of threads) you may run out of GPU memory. Not to mention, that accessing the hash will be almost as slow as accessing global memory.
All threads in a block will have the same itr value. Is it intended?
Every thread initializes only a single field within its own copy of hash table.
I see hash[tid] where tid is a global index. Be aware that even if hash was made global, you may hit concurrency problems. Not all blocks within a grid will run at the same time. While one block will initialize a portion of hash, another block might not even start!