What's the best (efficiently) way to zero a device vector allocated previously with cudaMalloc?
Launch one thread to do it in the GPU?
Link to cudaMemset()
cudaError_t cudaMemset ( void* devPtr, int value, size_t count )
Initializes or sets device memory to a value. Fills the first count bytes of the memory area pointed to by devPtr with the constant byte value value.
Note that this function is asynchronous with respect to the host unless devPtr refers to pinned host memory.
Note:
Note that this function may also return error codes from previous, asynchronous launches.
See also memset synchronization details.
Related
In the function:
void *mmap(void *addr, size_t length, int prot, int flags, int fd, off_t offset)
Does the argument length in mmap represent a number of bytes or a number of pages? Also, can I use mmap similarly to malloc? What are the differences?
The length parameter is in bytes. The Linux man page does not say this explicitly, but the POSIX spec says (emphasis mine):
The mmap() function shall establish a mapping between the address space of the process at an address pa for len bytes to the memory object represented by the file descriptor fildes at offset off for len bytes.
It is possible to use mmap as a way to allocate memory (you'll want to use MAP_ANONYMOUS or else map the /dev/zero device), but it's not in general a good direct replacement for malloc:
Mappings will always be made in page units (so the system will round up length to the next multiple of the page size), so it's very inefficient for small allocations.
You can't pass pointers returned by mmap to realloc or free (use mremap and munmap instead).
munmap actually returns memory to the system, whereas free may potentially keep it assigned to your process and just mark it available for use by future calls to malloc. This has pros and cons. On the one hand, if you know you will not be needing that memory in the future, it is nice to let the system have it back. On the other hand, every mmap/munmap requires a system call, which is relatively slow, whereas malloc may be able to allocate previously freed memory that already belongs to your process without a system call.
I'm implementing a naive memory manager for Vulkan device memory, and would like to make sure that I understand the alignment requirements for memory and how to satisfy them.
So, assuming that I've allocated a 'pool' of memory using vkAllocateMemory and wish to sub-allocate blocks of memory in this pool to individual resources (based on a VkMemoryRequirements struct), will the following pseudocode be able to allocate a section of this memory with the correct size and alignment requirements?
Request memory with RequiredSize and RequiredAlignment
Iterate over blocks in the pool looking for one that is free and has size > RequiredSize
If the offset in memory of the current block is NOT divisible by RequiredAlignment, figure out the difference between the alignment and the remainder
If the size of the current block minus the difference is less than RequiredSize, skip to the next block in the pool
If the difference is more than 0, insert a padding block with size equal to the difference, and adjust the current unallocated block size and offset
Allocate RequiredSize bytes from the start of the current unallocated block (now aligned), adjust the Size and Offset of the unallocated block accordingly
Return vkDeviceMemory handle (of pool), size and offset (of new allocated block)
If we reach the end of the block list instead, this pool cannot allocate the memory
In other words, do we just need to make sure that Offset is a multiple of RequiredAlignment?
In other words, do we just need to make sure that Offset is a multiple of RequiredAlignment?
for alignment that is nearly sufficient.
in vkBindbufferMemory one of the valid usage requirements is:
memoryOffset must be an integer multiple of the alignment member of the VkMemoryRequirements structure returned from a call to vkGetBufferMemoryRequirements with buffer
and there is a parallel statement in the valid usage requirements of vkBindImageMemory:
memoryOffset must be an integer multiple of the alignment member of the VkMemoryRequirements structure returned from a call to vkGetImageMemoryRequirements with image
If the previous block contains a non-linear resource while the current one is linear or vice versa then the alignment requirement is the max of the VkMemoryRequirements.alignment and the device's bufferImageGranularity. This also needs to be check for the end of the memory block.
However you also need to take into account that the memory type of the pool must be set in the memoryTypeBits flags of VkMemoryRequirements .
I'm writing a program in which I need to:
make a test on each pixel of an image
if test result is TRUE I have to add a point to a point cloud
if test result is FALSE, make nothing
I've already wrote a working code on CPU side C++.
Now I need to speed it up using CUDA. My idea was to make some block/thread (one thread per pixel I guess) execute the test in parallel and, if the test result is TRUE, make the thread to add a point to the cloud.
Here comes my trouble: How can I allocate space in device memory for a Point cloud (using cudaMalloc or similar) if I don't know a priori the number of point that I will insert in the cloud?
Do I have to allocate a fixed amount of memory and then increasing it everytime the point cloud reach the limit dimension? Or is there a method to "dynamically" allocate the memory?
When you allocate memory on the device, you may do so with two API calls: one is the malloc as described by Taro, but it is limited by some internal driver limit (8 MB by default), which can be increased by setting the appropriate limit with cudaDeviceSetLimit with parameter cudaLimitMallocHeapSize.
Alternately, you may use cudaMalloc within a kernel, as it is both a host and device API method.
In both cases, Taro's observation stands: you will allocate a new different buffer, as it would do on CPU by the way. Hence, using a single buffer might result in a need for a copy of data. Note that cudaMemcpy is not a device API method, hence, you may need to write your own.
To my knowledge, there is no such thing as realloc in the CUDA API.
Back to your original issue, you might want to implement your algorithm in three phases: First phase would count the number of samples you need, second phase would allocate the data array and third phase feed the data array. To implement this, you may use atomic functions to increment some int that counts the number of samples.
I would like to post this as a comment, as it only partially answers, but it is too long for this.
Yes, you can dynamically allocate memory from the kernels.
You can call malloc() and free() within your kernels to dynamically allocate and free memory during computation, as explained in the B-16 section of the CUDA 7.5 Programming Guide :
__global__ void mallocTest()
{
size_t size = 123;
char* ptr = (char*)malloc(size);
memset(ptr, 0, size);
printf("Thread %d got pointer: %p\n", threadIdx.x, ptr);
free(ptr);
}
int main()
{
// Set a heap size of 128 megabytes. Note that this must
// be done before any kernel is launched.
cudaDeviceSetLimit(cudaLimitMallocHeapSize, 128*1024*1024);
mallocTest<<<1, 5>>>();
cudaDeviceSynchronize();
return 0;
}
(You will need the compute capability 2.x or higher)
But by doing this you allocate a new and different buffer in memory, you don't make your previously - and allocated by the host - buffer "grow" like a CPU dynamic container (vector, list, etc).
I think you should set a constant setting the maximum size of your array, then allocating the maximum size, and making your kernel incrementing the "really used size" in this maximum buffer.
If doing so, don't forget to make this increment atomic/synchronized to count each increment from each concurrent thread.
Recently I was asked a question to implement a very simple malloc with the following restrictions and initial conditions.
#define HEAP_SIZE 2048
int main()
{
privateHeap = malloc(HEAP_SIZE + 256); //extra 256 bytes for heap metadata
void* ptr = mymalloc( size_t(750) );
myfree( ptr );
return 0;
}
I need to implement mymalloc and myfree here using the exact space provided. 256 bytes is nicely mapping to 2048 bits, and I can have a bit array storing if a byte is allocated or if it is free. But when I make a myfree call with ptr, I cannot tell how much size was allocated to begin with. I cannot use any extra bits.
I don't seem to think there is a way around this, but I've been reiterated that it can be done. Any suggestions ?
EDIT 1:
Alignment restrictions don't exist. I assumed I am not going to align anything.
There was a demo program that did a series of mallocs and frees to test this, and it didn't have any memory blocks that were small. But that doesn't guarantee anything.
EDIT 2:
The guidelines from the documentation:
Certain Guidelines on your code:
Manage the heap metadata in the private heap; do not create extra linked lists outside of the provided private heap;
Design mymalloc, myrealloc, myFree to work for all possible inputs.
myrealloc should do the following like the realloc in C++ library:
void* myrealloc( void* C, size_t newSize ):
If newSize is bigger than the size of chunk in reallocThis:
It should first try to allocate a chunk of size newSize in place so that new chunk's base pointer also is reallocThis;
If there is no free space available to do in place allocation, it should allocate a chunk of requested size in a different region;
and then it should copy the contents from the previous chunk.
If the function failed to allocate the requested block of memory, a NULL pointer is returned, and the memory block pointed to
by argument reallocThis is left unchanged.
If newSize is smaller, realloc should shrink the size of the chunk and should always succeed.
If newSize is 0, it should work like free.
If reallocThis is NULL, it should work like malloc.
If reallocThis is pointer that was already freed, then it should fail gracefully by returning NULL
myFree should not crash when it is passed a pointer that has already been freed.
A common way malloc implementations keep track of the size of memory allocations so free knows how big they are is to store the size in the bytes before pointer return by malloc. So say you only need two bytes to store the length, when the caller of malloc requests n bytes of memory, you actually allocate n + 2 bytes. You then store the length in the first two bytes, and return a pointer to the byte just past where you stored the size.
As for your algorithm generally, a simple and naive implementation is to keep track of unallocated memory with a linked list of free memory blocks that are kept in order of their location in memory. To allocate space you search for a free block that's big enough. You then modify the free list to exclude that allocation. To free a block you add it back to the free list, coalescing adjacent free blocks.
This isn't a good malloc implementation by modern standards, but a lot of old memory allocators worked this way.
You seem to be thinking of the 256 bytes of meta-data as a bit-map to track free/in-use on a byte-by-byte basis.
I'd consider the following as only one possible alternative:
I'd start by treating the 2048-byte heap as a 1024 "chunks" of 2 bytes each. This gives you 2 bits of information for each chunk. You can treat the first of those as signifying whether that chunk is in use, and the second as signifying whether the following chunk is part of the same logical block as the current one.
When your free function is called, you use the passed address to find the correct beginning point in your bitmap. You then walk through bits marking each chunk as free until you reach one where the second bit is set to 0, indicating the end of the current logical block (i.e., that the next 2 byte chunk is not part of the current logical block).
[Oops: just noticed that Ross Ridge already suggested nearly the same basic idea in a comment.]
I'm passing 3 arrays, with size N=224, to a kernel. The kernel itself calls another function foo(threadIdx.x) and foo calls another function bar(i) where i goes from 1 to 224. The second function needs to access the arrays passed to the kernel, but the code I have now tells me that the argument i is undefined.
I tried to save a copy of arrays into a shared memory but it didn't work::
__global__ void dummy(double *pos_x_d, double *pos_y_d, double *hist_d){
int i = threadIdx.x;
hist_d[i]=pos_x_d[i]+pos_y_d[i];
__syncthreads();
foo(i);
__syncthreads();
}
The Host code looks like::
cudaMalloc((void **) &pos_x_d,(N*sizeof(double)));
cudaMalloc((void **) &pos_y_d,(N*sizeof(double)));
cudaMalloc((void **) &hist_d,(N*sizeof(double)));
//Copy data to GPU
cudaMemcpy((void *)pos_x_d, (void*)pos_x_h,N*sizeof(double),cudaMemcpyHostToDevice);
cudaMemcpy((void *)pos_y_d, (void*)pos_y_h,N*sizeof(double),cudaMemcpyHostToDevice);
//Launch Kernel
dummy<<<1,224>>>(pos_x_d,pos_y_d,hist_d);
Is it possible to launch two kernels, 1st to send data to shared memory. Then, the second to do the calculations? I also need to loop over the second kernel, which is why I wanted to send data in the shared memory in the first place. The error is coming from line 89, 90 which means it has to do with the shared memory. Complete code is here
Is it possible to launch two kernels, 1st to send data to shared memory. Then, the second to do the calculations?
No, it's not possible. The lifetime of shared memory is the threadblock associated with that shared memory. A threadblock cannot reliably use the values stored by a different threadblock (whether from the same or different kernel launch) in shared memory.
The only way to save data from one kernel launch to the next is via global memory (or host memory).