Related
Hello my issue is this any advice will be greatfully accepted:
I have array of structs (representating Particles) but for simplify I have array containing only True values at start (Particle.exist = True). I am running my own CUDA kernel function on this array and in some cases the True value is changed to False. After that I have to move this Value to the end of array for better optimalization (No more working with dead Particle (exist = False)).
I have theoretically two options how to do this...
Some Parallel sorting Algorithms or
Move instead dead Particle to the end and shift array.
Second option should be better choice but I donĀ“t know how to do this in parallel. I could Have 1 000 000 Particles so shifting in one thread is not good idea...
Here is example of my code. I put Todo in part where I need shift array
struct Particle
{
float2 position;
float angle;
bool exists;
};
__global__ void moveParticles(Particle* particles, const unsigned int lengthOfParticles, const Particle* leaders, const unsigned int lengthOfLeaders, const unsigned int sizeOfLeader, const float speedFactor, const cudaTextureObject_t heightMapTexture)
{
unsigned int idx = blockIdx.x * blockDim.x + threadIdx.x;
const unsigned int skip = gridDim.x * blockDim.x;
while (idx < lengthOfParticles)
{
// If particle does not exist then do nothing and skip
if (!particles[idx].exists) { idx += skip; continue; }
float bestLength = 3.40282e+038;
unsigned int bestLeaderIndex;
for (unsigned int i = 0; i < lengthOfLeaders; i++)
{
float currentLength = (
(particles[idx].position.x - leaders[i].position.x) * (particles[idx].position.x - leaders[i].position.x)
) + (
(particles[idx].position.y - leaders[i].position.y) * (particles[idx].position.y - leaders[i].position.y)
);
if (currentLength < bestLength)
{
bestLength = currentLength;
bestLeaderIndex = i;
}
}
Particle bestLeader = leaders[bestLeaderIndex];
float differenceX = bestLeader.position.x - particles[idx].position.x;
float differenceY = bestLeader.position.y - particles[idx].position.y;
float newLength = sqrtf(differenceX * differenceX + differenceY * differenceY);
// If the newLenght is equal to zero, then the particle is at the same position as leader
// TODO: HERE I NEED SORT NOT EXISTING PARTICLE TO THE END
if (newLength <= sizeOfLeader / 2) { particles[idx].exists = false; idx += skip; continue; }
// Current height at the position
const uchar4 texelOfHeight = tex2D<uchar4>(heightMapTexture, particles[idx].position.x, particles[idx].position.y);
// Normalize vector
differenceX /= newLength;
differenceY /= newLength;
int nextPositionOnMapX = round(particles[idx].position.x + differenceX);
int nextPositionOnMapY = round(particles[idx].position.y + differenceY);
// Height of the next position
const uchar4 texelOfNextPosition = tex2D<uchar4>(heightMapTexture, nextPositionOnMapX, nextPositionOnMapY);
float differenceHeight = texelOfHeight.x - texelOfNextPosition.x;
float speed = sqrtf(speedFactor + 2 * fabsf(differenceHeight));
// Multiply by speed
differenceX *= speed;
differenceY *= speed;
particles[idx].position.x += differenceX;
particles[idx].position.y += differenceY;
idx += skip;
}
}
One possible solution what am I thinking about is do own kernel function which will only shifting particles. Something like this
__global__ void shiftParticles(const Particle* particles, const unsigned int lengthOfParticles, const unsigned int sizeOfParticle) {
unsigned int idx = blockIdx.x * blockDim.x + threadIdx.x;
const unsigned int skip = gridDim.x * blockDim.x;
//TODO: Shifting...
}
Sorting on GPUs is rather inefficient, so it is better to select the values to keep and perform a partition based on them. To do that easily, you can use CUB which is quite efficient (as it often implement best state-of-the-art algorithm or close to).
You can use DevicePartition or two DeviceSelect (the former will likely be faster, except if you do not want to keep dead particles at all). You could also use block primitives if you want to perform some advanced tweaks/optimizations.
If you still want to do this yourself for some reason (eg. reducing the number of dependencies in your project), then you can use atomic adds on relatively new devices since they are very-well optimized by the hardware. On old device you could use scans to do that but it is a but harder to implement. The thing is atomics do not scale particularly when there is a lot of SM, so you need to perform some advanced blocking strategy. Here is an untested naive implementation to understand the idea:
// PS: what is the difference between sizeOfParticle and lengthOfParticles?
// pos must be initialized to 0 and contains the number of living particles (pivot) once the kernel finished its execution.
__global__ void shiftParticles(const Particle* particles, const unsigned int lengthOfParticles, const unsigned int sizeOfParticle, Particle* outParticles, int* pos) {
unsigned int idx = blockIdx.x * blockDim.x + threadIdx.x;
const unsigned int skip = gridDim.x * blockDim.x;
const bool exists = particles[idx].exists;
const int localPos = atomicAdd(pos, exists); // Here is the important point
const Particle current = particles[idx];
// outParticles is a needed temporary array or output one
// as the operation cannot be efficiently performed in parallel.
// It should likely be allocated and provided in argument to the kernel
if(exists)
{
// Move the current particle to the beginning
outParticles[localPos] = current;
}
else
{
// Move the current particle to the end
outParticles[lengthOfParticles-1-idx+localPos] = current;
}
}
Note that the ordering is not preserved due to the atomic operations. If you need to keep the order of the particles, then it gets significantly more complicated, especially on GPUs, since it would make the algorithm more sequential. A naive solution could be to use a stable sort in that case. Another solution is to use a global scan followed by an indirection to store the values (so with two pass). Implementing an efficient scan is a bit complex/tedious. Hopefully, CUB can help a lot in this case with its DeviceScan primitive.
Finally note that using array of structures is not efficient, especially on hardware using SIMD instructions like GPUs. The implementation should be significantly faster with structures of arrays (due to cache lines, coalescence, contiguity of access pattern, etc.).
I am new to Cuda programming, I have a code that converts an RGB image to Greyscale. The algorithm for reading RGB values of pixel and converting them to GreyScale has been provided to us.
Parallelizing the code has given me around 40-50x speed up.I want to optimize it further to achieve around 100x speedup. For this purpose I want to use shared memory access as its magnitude faster than Global Memory Access. I have gone through different online resources and have the basic understanding of shared memory access. But in my code I am having problem understanding how to implement shared memory, The code to read RGB values and converting to Greyscale
for ( int y = 0; y < height; y++ ) {
for ( int x = 0; x < width; x++ ) {
float grayPix = 0.0f;
float r = static_cast< float >(inputImage[(y * width) + x]);
float g = static_cast< float >(inputImage[(width * height) + (y * width) + x]);
float b = static_cast< float >(inputImage[(2 * width * height) + (y * width) + x]);
grayPix = ((0.3f * r) + (0.59f * g) + (0.11f * b));
grayPix = (grayPix * 0.6f) + 0.5f;
darkGrayImage[(y * width) + x] = static_cast< unsigned char >(grayPix);
}
}
Input image a char* and we are using CImg library to manipulate image
CImg< unsigned char > inputImage = CImg< unsigned char >(argv[1]);
Where user passes the path to image as a argument while running the code
This is my Cuda implementation of it
unsigned int y = (blockIdx.x * blockDim.x) + threadIdx.x;
unsigned int x = (blockIdx.y * blockDim.y) + threadIdx.y;
float grayPix = 0.0f;
float r = static_cast< float >(inputImage[(y * height) + x]);
float g = static_cast< float >(inputImage[(width * height) + (y * height) + x]);
float b = static_cast< float >(inputImage[(2 * width * height) + (y * height) + x]);
grayPix = ((0.3f * r) + (0.59f * g) + (0.11f * b));
grayPix = (grayPix * 0.6f) + 0.5f;
darkGrayImage[(y * height) + x] = static_cast< unsigned char >(grayPix);
The Grid and block and calling the code
dim3 gridSize(width/16,height/16);
dim3 blockSize(16,16);
greyScale<<< gridSize, blockSize >>>(width,height,d_in, d_out);
where width and height are the width and height of input image. I tried with block size of (32,32) but it slowed down the code instead of speeding it up
Now i Want to add shared memory but the problem the access to the input variable InputImage is quite non linear, so what values do I add to the shared memory
I tried something like
unsigned int y = (blockIdx.x * blockDim.x) + threadIdx.x;
unsigned int x = (blockIdx.y * blockDim.y) + threadIdx.y;
extern __shared__ int s[];
s[x]=inputImage[x];
__syncthreads();
and then replacing inputImage with s in the implementation but that just gave a wrong output (all black image)
Can you help me out here to understand how can i implement shared memory, if even its possible and useful and is there a way i can make my access in a more coalesced way ?
Any help would be grateful
This can't work for several reasons:
unsigned int x = (blockIdx.y * blockDim.y) + threadIdx.y;
extern __shared__ int s[];
s[x]=inputImage[x];
One reason is that we cannot use a global index (x) as a shared memory index, unless the data set is small enough to fit in shared memory. For an image of reasonably large dimensions, you cannot fit the entire image into a single instance of shared memory. Furthermore, you are using only one dimensional index (x) of a two dimensional data set, so this can't possibly make sense.
This suggests a general lack of understanding of how to use shared memory in a program. However, rather than trying to sort this out, we can observe that for a properly written RGB->grayscale code, shared memory usage is unlikely to provide any benefit.
Shared memory bandwidth benefits (which is what you are referring to when you say "magnitude faster") are valuable when there is data re-use. An RGB->grayscale code should not require any data re-use. You load each R,G,B quantity exactly once from global memory, and you store the computed grayscale quantity exactly once in global memory. Moving some of this data temporarily to shared memory is not going to speed anything up. You still have to do the global loads and global stores, and for a properly written code, this should be all that is necessary.
However in your question you've already suggested a possible improvement path: coalesced access. If you were to profile your posted code, you would find completely uncoalesced access patterns. For good coalescing, we want compound index calculations to have the property that the threadIdx.x variable is not multiplied by anything:
unsigned int y = (blockIdx.x * blockDim.x) + threadIdx.x;
unsigned int x = (blockIdx.y * blockDim.y) + threadIdx.y;
float grayPix = 0.0f;
float r = static_cast< float >(inputImage[(y * height) + x]);
^
|
y depends on threadIdx.x
But in your case, your index calculation is multiplying threadIdx.x by height. This will result in non-coalesced access. Adjacent threads in a warp will have varying threadIdx.x, and we want index calculations of adjacent threads in the warp to result in adjacent locations in memory, for good coalesced access. You cannot achieve this if you multiply threadIdx.x by anything.
The solution for this problem is quite simple. You should just use kernel code that is almost an exact duplicate of the non-CUDA code you have shown, with appropriate definitions for x and y:
unsigned int x = (blockIdx.x * blockDim.x) + threadIdx.x;
unsigned int y = (blockIdx.y * blockDim.y) + threadIdx.y;
if ((x < width) && (y < height)){
float grayPix = 0.0f;
float r = static_cast< float >(inputImage[(y * width) + x]);
float g = static_cast< float >(inputImage[(width * height) + (y * width) + x]);
float b = static_cast< float >(inputImage[(2 * width * height) + (y * width) + x]);
grayPix = ((0.3f * r) + (0.59f * g) + (0.11f * b));
grayPix = (grayPix * 0.6f) + 0.5f;
darkGrayImage[(y * width) + x] = static_cast< unsigned char >(grayPix);
}
Naturally, this is not a complete code. You have not shown a complete code, so if you respond with "I tried this but it doesn't work", it's unlikely I'll be able to help you much, since I don't know what code you're actually running. But:
Shared memory is not the right way to go for this algorithm
You undoubtedly have a coalescing issue in your posted code, for the reasons I indicate
The coalescing fix should follow the path I outlined
Your performance should improve with the coalescing fix.
Note that a response of "it doesn't work" means you are really asking for debugging assistance, not conceptual explanation, in which case you are supposed to provide an MCVE. What you have shown is not an MCVE. Preferably your MCVE should not depend on an external library like CImg, which means it requires effort on your part to create one that would be a standalone test, but demonstrating the problem you are having.
Also, I would suggest whenever you are having trouble with a CUDA code, to use proper CUDA error checking as well as run your code with cuda-memcheck.
(Proper CUDA error checking would have identified a problem with your attempt to use shared memory, for example, due to out-of-bounds indexing in shared memory.)
I am trying to write a code in C++, but after some search on the internet, I found one OpenCL based code is doing exactly the same thing as I want to do in C++. But since this is the first time I see a OpenCL code, I don't know how to change the following functions into c++:
const __global float4 *in_buf;
int x = get_global_id(0);
int y = get_global_id(1);
float result = y * get_global_size(0);
Is 'const __global float4 *in_buf' equivalent to 'const float *in_buf' in c++? And how to change the above other functions? Could anyone help? Thanks.
In general, you should take a look at the OpenCL specification (I'm assuming it's written in OpenCL 1.x) to better understand functions, types and how a kernel works.
Specifically for your question:
get_global_id returns the id of the current work item, and get_global_size returns the total number of work items. Since an OpenCL work-item is roughly equivalent to a single iteration in a sequential language, the equivalent of OpenCL's:
int x = get_global_id(0);
int y = get_global_id(1);
// do something with x and y
float result = y * get_global_size(0);
Will be C's:
for (int x = 0; x < dim0; x++) {
for (int y = 0; y < dim1; y++) {
// do something with x and y
float result = y * dim0;
}
}
As for float4 it's a vector type of 4 floats, roughly equivalent to C's float[4] (except that it supports many additional operators, such as vector arithmetic). Of course in this case it's a buffer, so an appropriate type would be float** or float[4]* - or better yet, just pack them together into a float* buffer and then load 4 at a time.
Feel free to ignore the __global modifier.
const __global float4 *in_buf is not equivalent to const float *in_buf.
The OpenCL uses vector variables, e.g. floatN, where N is e.g. 2,4,8. So float4 is in fact struct { float w, float x, float y, float z} with lot of tricks available to express vector operations.
get_global_id(0) gives you the iterator variable, so essentially replace every get_global_id(dim) with for(int x = 0; x< max[dim]; x++)
so this is a followup to a question i had, at the moment in a CPU version of some Code, i have many things that look like the following:
for(int i =0;i<N;i++){
dgemm(A[i], B[i],C[i], Size[i][0], Size[i][1], Size[i][2], Size[i][3], 'N','T');
}
where A[i] will be a 2D matrix of some size.
I would like to be able to do this on a GPU using CULA (I'm not just doing multiplies, so i need the Linear ALgebra operations in CULA), so for example:
for(int i =0;i<N;i++){
status = culaDeviceDgemm('T', 'N', Size[i][0], Size[i][0], Size[i][0], alpha, GlobalMat_d[i], Size[i][0], NG_d[i], Size[i][0], beta, GG_d[i], Size[i][0]);
}
but I would like to store my B's on the GPU in advance at the start of the program as they dont change, so I need to have a vector that contains pointers to the set of vectors that make up my B's.
i currently have the following code that compiles:
double **GlobalFVecs_d;
double **GlobalFPVecs_d;
extern "C" void copyFNFVecs_(double **FNFVecs, int numpulsars, int numcoeff){
cudaError_t err;
GlobalFPVecs_d = (double **)malloc(numpulsars * sizeof(double*));
err = cudaMalloc( (void ***)&GlobalFVecs_d, numpulsars*sizeof(double*) );
checkCudaError(err);
for(int i =0; i < numpulsars;i++){
err = cudaMalloc( (void **) &(GlobalFPVecs_d[i]), numcoeff*numcoeff*sizeof(double) );
checkCudaError(err);
err = cudaMemcpy( GlobalFPVecs_d[i], FNFVecs[i], sizeof(double)*numcoeff*numcoeff, cudaMemcpyHostToDevice );
checkCudaError(err);
}
err = cudaMemcpy( GlobalFVecs_d, GlobalFPVecs_d, sizeof(double*)*numpulsars, cudaMemcpyHostToDevice );
checkCudaError(err);
}
but if i now try and access it with:
dim3 dimBlock(BLOCK_SIZE, BLOCK_SIZE);
dim3 dimGrid;//((G + dimBlock.x - 1) / dimBlock.x,(N + dimBlock.y - 1) / dimBlock.y);
dimGrid.x=(numcoeff + dimBlock.x - 1)/dimBlock.x;
dimGrid.y = (numcoeff + dimBlock.y - 1)/dimBlock.y;
for(int i =0; i < numpulsars; i++){
CopyPPFNF<<<dimGrid, dimBlock>>>(PPFMVec_d, GlobalFVecs_d[i], numpulsars, numcoeff, i);
}
it seg faults here, is this not how to get at the data?
The kernal function that i'm calling is just:
__global__ void CopyPPFNF(double *FNF_d, double *PPFNF_d, int numpulsars, int numcoeff, int thispulsar) {
// Each thread computes one element of C
// by accumulating results into Cvalue
int row = blockIdx.y * blockDim.y + threadIdx.y;
int col = blockIdx.x * blockDim.x + threadIdx.x;
int subrow=row-thispulsar*numcoeff;
int subcol=row-thispulsar*numcoeff;
__syncthreads();
if(row >= (thispulsar+1)*numcoeff || col >= (thispulsar+1)*numcoeff) return;
if(row < thispulsar*numcoeff || col < thispulsar*numcoeff) return;
FNF_d[row * numpulsars*numcoeff + col] += PPFNF_d[subrow*numcoeff+subcol];
}
What am i not doing right? Note eventually I would also like to do as the first example, calling cula functions on each GlobalFVecs_d[i], but for now not even this works.
Do you think this is the best way to go about doing this? If it were possible to just pass CULA functions a slice of a large continuous vector I could do that to, but i don't know if it supports that.
Cheers
Lindley
change this:
CopyPPFNF<<<dimGrid, dimBlock>>>(PPFMVec_d, GlobalFVecs_d[i], numpulsars, numcoeff, i);
to this:
CopyPPFNF<<<dimGrid, dimBlock>>>(PPFMVec_d, GlobalFPVecs_d[i], numpulsars, numcoeff, i);
and I believe it will work.
Your methodology of handling pointers is mostly correct. However, when you put GlobalFVecs_d[i] in the parameter list, you are forcing the kernel setup code (running on the host) to take GlobalFVecs_d (a device pointer, created with cudaMalloc), add an appropriately scaled i to the pointer value, and then dereference the resultant pointer to retrieve the value to pass as a parameter to the kernel. But we are not allowed to dereference device pointers in host code.
However, because your methodology was mostly correct, you have a convenient parallel array of the same pointers that resides on the host. This array (GlobalFPVecs_d) is something that we are allowed to dereference into, in host code, to retrieve the resultant device pointer, to pass to the kernel.
It's an interesting bug because normally kernels do not seg fault (although they may throw an error), so a seg fault on a kernel invocation line is unusual. But in this case, the seg fault is occurring in the kernel setup code, not the kernel itself.
I need to create a 2D int array of size 800x800. But doing so creates a stack overflow (ha ha).
I'm new to C++, so should I do something like a vector of vectors? And just encapsulate the 2d array into a class?
Specifically, this array is my zbuffer in a graphics program. I need to store a z value for every pixel on the screen (hence the large size of 800x800).
Thanks!
You need about 2.5 megs, so just using the heap should be fine. You don't need a vector unless you need to resize it. See C++ FAQ Lite for an example of using a "2D" heap array.
int *array = new int[800*800];
(Don't forget to delete[] it when you're done.)
Every post so far leaves the memory management for the programmer. This can and should be avoided. ReaperUnreal is darn close to what I'd do, except I'd use a vector rather than an array and also make the dimensions template parameters and change the access functions -- and oh just IMNSHO clean things up a bit:
template <class T, size_t W, size_t H>
class Array2D
{
public:
const int width = W;
const int height = H;
typedef typename T type;
Array2D()
: buffer(width*height)
{
}
inline type& at(unsigned int x, unsigned int y)
{
return buffer[y*width + x];
}
inline const type& at(unsigned int x, unsigned int y) const
{
return buffer[y*width + x];
}
private:
std::vector<T> buffer;
};
Now you can allocate this 2-D array on the stack just fine:
void foo()
{
Array2D<int, 800, 800> zbuffer;
// Do something with zbuffer...
}
I hope this helps!
EDIT: Removed array specification from Array2D::buffer. Thanks to Andreas for catching that!
Kevin's example is good, however:
std::vector<T> buffer[width * height];
Should be
std::vector<T> buffer;
Expanding it a bit you could of course add operator-overloads instead of the at()-functions:
const T &operator()(int x, int y) const
{
return buffer[y * width + x];
}
and
T &operator()(int x, int y)
{
return buffer[y * width + x];
}
Example:
int main()
{
Array2D<int, 800, 800> a;
a(10, 10) = 50;
std::cout << "A(10, 10)=" << a(10, 10) << std::endl;
return 0;
}
You could do a vector of vectors, but that would have some overhead. For a z-buffer the more typical method would be to create an array of size 800*800=640000.
const int width = 800;
const int height = 800;
unsigned int* z_buffer = new unsigned int[width*height];
Then access the pixels as follows:
unsigned int z = z_buffer[y*width+x];
I might create a single dimension array of 800*800. It is probably more efficient to use a single allocation like this, rather than allocating 800 separate vectors.
int *ary=new int[800*800];
Then, probably encapsulate that in a class that acted like a 2D array.
class _2DArray
{
public:
int *operator[](const size_t &idx)
{
return &ary[idx*800];
}
const int *operator[](const size_t &idx) const
{
return &ary[idx*800];
}
};
The abstraction shown here has a lot of holes, e.g, what happens if you access out past the end of a "row"? The book "Effective C++" has a pretty good discussion of writing good multi dimensional arrays in C++.
One thing you can do is change the stack size (if you really want the array on the stack) with VC the flag to do this is [/F](http://msdn.microsoft.com/en-us/library/tdkhxaks(VS.80).aspx).
But the solution you probably want is to put the memory in the heap rather than on the stack, for that you should use a vector of vectors.
The following line declares a vector of 800 elements, each element is a vector of 800 ints and saves you from managing the memory manually.
std::vector<std::vector<int> > arr(800, std::vector<int>(800));
Note the space between the two closing angle brackets (> >) which is required in order disambiguate it from the shift right operator (which will no longer be needed in C++0x).
Or you could try something like:
boost::shared_array<int> zbuffer(new int[width*height]);
You should still be able to do this too:
++zbuffer[0];
No more worries about managing the memory, no custom classes to take care of, and it's easy to throw around.
There's the C like way of doing:
const int xwidth = 800;
const int ywidth = 800;
int* array = (int*) new int[xwidth * ywidth];
// Check array is not NULL here and handle the allocation error if it is
// Then do stuff with the array, such as zero initialize it
for(int x = 0; x < xwidth; ++x)
{
for(int y = 0; y < ywidth; ++y)
{
array[y * xwidth + x] = 0;
}
}
// Just use array[y * xwidth + x] when you want to access your class.
// When you're done with it, free the memory you allocated with
delete[] array;
You could encapsulate the y * xwidth + x inside a class with an easy get and set method (possibly with overloading the [] operator if you want to start getting into more advanced C++). I'd recommend getting to this slowly though if you're just starting with C++ and not start creating re-usable fully class templates for n-dimension arrays which will just confuse you when you're starting off.
As soon as you get into graphics work you might find that the overhead of having extra class calls might slow down your code. However don't worry about this until your application isn't fast enough and you can profile it to show where the time is lost, rather than making it more difficult to use at the start with possible unnecessary complexity.
I found that the C++ lite FAQ was great for information such as this. In particular your question is answered by:
http://www.parashift.com/c++-faq-lite/freestore-mgmt.html#faq-16.16
You can allocate array on static storage (in file's scope, or add static qualifier in function scope), if you need only one instance.
int array[800][800];
void fn()
{
static int array[800][800];
}
This way it will not go to the stack, and you not have to deal with dynamic memory.
Well, building on what Niall Ryan started, if performance is an issue, you can take this one step further by optimizing the math and encapsulating this into a class.
So we'll start with a bit of math. Recall that 800 can be written in powers of 2 as:
800 = 512 + 256 + 32 = 2^5 + 2^8 + 2^9
So we can write our addressing function as:
int index = y << 9 + y << 8 + y << 5 + x;
So if we encapsulate everything into a nice class we get:
class ZBuffer
{
public:
const int width = 800;
const int height = 800;
ZBuffer()
{
for(unsigned int i = 0, *pBuff = zbuff; i < width * height; i++, pBuff++)
*pBuff = 0;
}
inline unsigned int getZAt(unsigned int x, unsigned int y)
{
return *(zbuff + y << 9 + y << 8 + y << 5 + x);
}
inline unsigned int setZAt(unsigned int x, unsigned int y, unsigned int z)
{
*(zbuff + y << 9 + y << 8 + y << 5 + x) = z;
}
private:
unsigned int zbuff[width * height];
};