I'm passing 3 arrays, with size N=224, to a kernel. The kernel itself calls another function foo(threadIdx.x) and foo calls another function bar(i) where i goes from 1 to 224. The second function needs to access the arrays passed to the kernel, but the code I have now tells me that the argument i is undefined.
I tried to save a copy of arrays into a shared memory but it didn't work::
__global__ void dummy(double *pos_x_d, double *pos_y_d, double *hist_d){
int i = threadIdx.x;
hist_d[i]=pos_x_d[i]+pos_y_d[i];
__syncthreads();
foo(i);
__syncthreads();
}
The Host code looks like::
cudaMalloc((void **) &pos_x_d,(N*sizeof(double)));
cudaMalloc((void **) &pos_y_d,(N*sizeof(double)));
cudaMalloc((void **) &hist_d,(N*sizeof(double)));
//Copy data to GPU
cudaMemcpy((void *)pos_x_d, (void*)pos_x_h,N*sizeof(double),cudaMemcpyHostToDevice);
cudaMemcpy((void *)pos_y_d, (void*)pos_y_h,N*sizeof(double),cudaMemcpyHostToDevice);
//Launch Kernel
dummy<<<1,224>>>(pos_x_d,pos_y_d,hist_d);
Is it possible to launch two kernels, 1st to send data to shared memory. Then, the second to do the calculations? I also need to loop over the second kernel, which is why I wanted to send data in the shared memory in the first place. The error is coming from line 89, 90 which means it has to do with the shared memory. Complete code is here
Is it possible to launch two kernels, 1st to send data to shared memory. Then, the second to do the calculations?
No, it's not possible. The lifetime of shared memory is the threadblock associated with that shared memory. A threadblock cannot reliably use the values stored by a different threadblock (whether from the same or different kernel launch) in shared memory.
The only way to save data from one kernel launch to the next is via global memory (or host memory).
Related
I thought that shared memory of a cuda-device is private to a block. However, it seems to me that the pointer of shared memory across two blocks is identical:
#include <stdio.h>
__global__ void foo() {
__shared__ int ar[8];
printf("shared memory pointer %p at blockidx %i\n", ar, blockIdx.x);
}
int main() {
dim3 blockDim(1);
dim3 gridDim(2);
foo<<<gridDim, blockDim>>>();
cudaDeviceSynchronize();
}
Running to code above produces:
shared memory pointer 0x7f88f5000000 at blockidx 0
shared memory pointer 0x7f88f5000000 at blockidx 1
With this program, I expected to create two different blocks, initialize shared memory on each block and obtain two different locations for the memory. Am I misunderstanding something? Or do these pointer indeed have a different physical location but within a block the addresses seem to be the same?
Shared memory is block-private, i.e. threads from one block cannot access another block's shared memory.
... for this very reason, it's actually to be expected that the address range for shared memory will be the same for all blocks - but in each block, loading from or storing to these addresses affects the block-local shared memory.
For intuition: This is somewhat similar to how, on the CPU, code in two processes may use identical pointer addresses but they will actually access different physical locations in memory (usually).
I am trying to run the following code about OpenCL. In kernel function, I will define an array int arr[1000] = {0};
kernel void test()
{
int arr[1000] = {0};
}
Then I will create N threads to run the kernel.
cl::CommandQueue cmdQueue;
cmdQueue.enqueueNDRangeKernel(kernel, cl::NullRange, cl::NDRange(N), cl::NullRange); // kernel here is the one running test()
My question is, since we know that OpenCL will parallel run the threads, does it mean that, the peak memory will be N * 1000 * sizeof(int)?
This is not the way to OpenCL (yes, that's what I meant :).
The kernel function operates on kernel operands passed in from the host (CPU) - so you'd allocate your array on the host using clCreateBuffer and set the arg using clSetKernelArg. Your kernel does not declare/allocate the device memory, but simply receives it as an __global argument. Now when you run the kernel using clEnqueueNDRangeKernel, the OpenCL implementation will allocate 1000 ints and run a thread on each of those ints.
If, on the other hand you meant to allocate 1000 ints per work-item (device thread), your calculation is right (yes, they cost memory from the local pool) but it probably won't work. OpenCL work-items have access to only local memory (see here on how to check this for your device) which is severely limited.
I'm writing a program in which I need to:
make a test on each pixel of an image
if test result is TRUE I have to add a point to a point cloud
if test result is FALSE, make nothing
I've already wrote a working code on CPU side C++.
Now I need to speed it up using CUDA. My idea was to make some block/thread (one thread per pixel I guess) execute the test in parallel and, if the test result is TRUE, make the thread to add a point to the cloud.
Here comes my trouble: How can I allocate space in device memory for a Point cloud (using cudaMalloc or similar) if I don't know a priori the number of point that I will insert in the cloud?
Do I have to allocate a fixed amount of memory and then increasing it everytime the point cloud reach the limit dimension? Or is there a method to "dynamically" allocate the memory?
When you allocate memory on the device, you may do so with two API calls: one is the malloc as described by Taro, but it is limited by some internal driver limit (8 MB by default), which can be increased by setting the appropriate limit with cudaDeviceSetLimit with parameter cudaLimitMallocHeapSize.
Alternately, you may use cudaMalloc within a kernel, as it is both a host and device API method.
In both cases, Taro's observation stands: you will allocate a new different buffer, as it would do on CPU by the way. Hence, using a single buffer might result in a need for a copy of data. Note that cudaMemcpy is not a device API method, hence, you may need to write your own.
To my knowledge, there is no such thing as realloc in the CUDA API.
Back to your original issue, you might want to implement your algorithm in three phases: First phase would count the number of samples you need, second phase would allocate the data array and third phase feed the data array. To implement this, you may use atomic functions to increment some int that counts the number of samples.
I would like to post this as a comment, as it only partially answers, but it is too long for this.
Yes, you can dynamically allocate memory from the kernels.
You can call malloc() and free() within your kernels to dynamically allocate and free memory during computation, as explained in the B-16 section of the CUDA 7.5 Programming Guide :
__global__ void mallocTest()
{
size_t size = 123;
char* ptr = (char*)malloc(size);
memset(ptr, 0, size);
printf("Thread %d got pointer: %p\n", threadIdx.x, ptr);
free(ptr);
}
int main()
{
// Set a heap size of 128 megabytes. Note that this must
// be done before any kernel is launched.
cudaDeviceSetLimit(cudaLimitMallocHeapSize, 128*1024*1024);
mallocTest<<<1, 5>>>();
cudaDeviceSynchronize();
return 0;
}
(You will need the compute capability 2.x or higher)
But by doing this you allocate a new and different buffer in memory, you don't make your previously - and allocated by the host - buffer "grow" like a CPU dynamic container (vector, list, etc).
I think you should set a constant setting the maximum size of your array, then allocating the maximum size, and making your kernel incrementing the "really used size" in this maximum buffer.
If doing so, don't forget to make this increment atomic/synchronized to count each increment from each concurrent thread.
developers,
may someone give me a hint please?
I didn't find any information about how to allocate constant and dynamic shared memory in the same kernel, or lets ask more preciously:
How to call a kernel where the amount of shared memory that needs to allocated is just partly known at compilation time?
Referring to allocating shared memory for example, it becomes pretty obvious how to do for dynamic allocation.
But lets assume I have the following kernel:
__global__ void MyKernel(int Float4ArrSize, int FloatArrSize)
{
__shared__ float Arr1[256];
__shared__ char Arr2[256];
extern __shared_ float DynamArr[];
float4* DynamArr1 = (float4*) DynamArr;
float* DynamArr = (float*) &DynamArr1[Float4ArrSize];
// do something
}
Kernel Call:
int SharedMemorySize = Float4ArrSize + FloatArrSize;
SubstractKernel<<< numBlocks, threadsPerBlock, SharedMemorySize, stream>>>(Float4ArrSize, FloatArrSize)
I'm actually wasn't able to figure out how the compiler is linking the size a shared memory only to the part I want to allocate dynamically.
Or does the parameter "SharedMemeorySize" represents the total amount of shared memory per block, so I need to calculate in the size of constant memory (int SharedMemorySize = Float4ArrSize + FloatArrSize + 256*sizeof(float)+ 256*sizeof(char)) ?
Please enlighten me or just simply point to some code snippets.
Thanks a lot in advance.
cheers greg
Citing programing guide, SharedMemorySize specifies the number of bytes in shared memory that is dynamically allocated per block for this call in addition to the statically allocated memory; this dynamically allocated memory is used by any of the
variables declared as an external array. SharedMemorySize is an optional argument which defaults to 0.
So if I understand what you want to do, it should probably look like
extern __shared_ float DynamArr[];
float* DynamArr1 = DynamArr;
float4* DynamArr2 = (float4*) &DynamArr[DynamArr1_size];
Be aware, I didn't test it.
Here is very useful post.
From the CUDA programming guide:
The [kernel's] execution configuration is specified by inserting an expression of
the form <<< Dg, Db, Ns, S >>> between the function name and the
parenthesized argument list, where:
Ns is of type size_t and specifies the number of bytes in shared memory that is dynamically allocated per block for this call in
addition to the statically allocated memory; this dynamically
allocated memory is used by any of the variables declared as an
external array as mentioned in __shared__; Ns is an optional
argument which defaults to 0;
So basically, the shared memory size that you specify during the kernel call is related to the dinamically allocated shared memory. You don't have to manually add the size of your statically allocated arrays in shared memory.
What's the best (efficiently) way to zero a device vector allocated previously with cudaMalloc?
Launch one thread to do it in the GPU?
Link to cudaMemset()
cudaError_t cudaMemset ( void* devPtr, int value, size_t count )
Initializes or sets device memory to a value. Fills the first count bytes of the memory area pointed to by devPtr with the constant byte value value.
Note that this function is asynchronous with respect to the host unless devPtr refers to pinned host memory.
Note:
Note that this function may also return error codes from previous, asynchronous launches.
See also memset synchronization details.