Implementing realloc in CUDA without moving data

Implementing realloc in CUDA without moving data - c++

According to this question and reference NVIDIA CUDA Programming Guide the realloc function is not implemented:
The CUDA in-kernel malloc() function allocates at least size bytes
from the device heap and returns a pointer to the allocated memory or
NULL if insufficient memory exists to fulfill the request. The
returned pointer is guaranteed to be aligned to a 16-byte boundary.
The CUDA in-kernel free() function deallocates the memory pointed to
by ptr, which must have been returned by a previous call to malloc().
If ptr is NULL, the call to free() is ignored. Repeated calls to
free() with the same ptr has undefined behavior.
I am currectly stuck with some portion of GMP library (or more strictly my attempt to port it on CUDA), which relies on this functionaliy:
__host__ __device__ static void * // generate this function for both CPU and GPU
gmp_default_realloc (void *old, size_t old_size, size_t new_size)
{
mp_ptr p;
#if __CUDA_ARCH__ // this directive separates device and host code
/* ? */
#else
p = (mp_ptr) realloc (old, new_size); /* host code has realloc from glibc */
#endif
if (!p)
gmp_die("gmp_default_realoc: Virtual memory exhausted.");
return p;
}
Essentially I can just simply call malloc with new_size, then call memcpy (or maybe memmove), then free previous block, but this requires obligatory moving of data (large arrays), which I would like to avoid.
Is there any effective efficient way to implement (standard C or C++) realloc function (i.e. inside kernel) ? Let's say that I have some large array of dynamically allocated data (already allocated by malloc), then in some other place realloc is invoked in order to request some larger amount of memory for that block. In short I would like to avoid copying whole data array into new location and I ask specifically how to do it (of course if it's possible at all).
I am not especially familiar with PTX ISA or underlying implementation of in-kernel heap functions, but maybe it's worth a look into that direction ?

Most malloc implementations over-allocate, this is the reason why realloc can sometimes avoid copying bytes - the old block may be large enough for the new size. But apparently in your environment the system malloc doesn't do that, so I think your only option is to reimplement all 3 primitives, gmp_default_{alloc,realloc,free} on top of the system-provided malloc/free.
There are many open-source malloc implementation out there, glibc has one you might be able to adapt.
I'm not familiar with CUDA or GMP, but off the top of my head:
gmp_malloc() followed by plain free() probably works on "normal" platforms, but will likely cause heap corruption if you go ahead with this
if all you want is a more efficient realloc, you can simply overallocate in your custom malloc (up to some size, say the nearest power of 2), just so you can avoid copying in the subseauent re-alloc. You don't even need a full-blown heap implementation for that.
your implementation may need to use a mutex or some such to protect your heap against concurrent modifications
you can improve performance even more if you never (or infrequently) return the malloc()ed blocks back to the OS from within your custom heap, I.e keep the gmp_free()ed blocks around for subsequent reuse instead of calling the system free() on them immediately
come to think of it, a better idea would be to introduce a sane malloc implementation into that platform, outside of your GMP lib, so that other programs and libraries could draw their memory from the same pool, instead of GMP doing one thing and everything else doing something else. This should help with the overall memory consumption w.r.t previous point. Maybe you should port glibc first :)

Related

New vs. Malloc, when overloading New

I'm overloading new and delete to implement my own small-objects/thread-safe allocator.
The problem is that when I am overloading new, I cannot use new without breaking universal causality or at least the compiler. Most examples I found where new is overloaded, use Malloc() to do the actual allocation. But from what I understood of C++, there is no use-case for Malloc() at all.
Multiple answers similar to this one, some with less tort outside of SO: In what cases do I use malloc vs new?
My question, is how do I allocate the actual memory when overloading operator new without using Malloc() ?
(This is out of curiosity more than anything, try not to take the reasoning behind the overload too seriously; I have a seperate question out on that anywho!)

Short answer: if you don't want existing malloc, you need to implement your own heap manager.
A heap manager, for example malloc in glibc of Linux, HeapAlloc in Windows, is a user-level algorithm. First, keep in mind that heap is optimized for allocating small sizes of objects like 4~512 bytes.
How to implement your own heap manager? At least, you must call a system API that allocates a memory chunk in your process. There are VirtualAlloc for Windows and sbrk for Linux. These APIs allocate a large chunk of memory, but the size must be multiple of page size. Typically, the size of page in x86 and Windows/Linux is 4KB.
After obtaining a chunk of page, you need to implement your own algorithms how to chop down this big memory into smaller requests. A classic (still very practical) implementation and algorithm is dlmalloc: http://g.oswego.edu/dl/html/malloc.html
To implement, you need to have several data structures for book-keeping and a number of policies for optimization. For example, for small objects like 16, 20, 36, 256 bytes, a heap manager maintains a list of blocks of each size. So, there are a list of lists. If requested size is bigger than a page size, then it just call VirtualAlloc or sbrk. However, an efficient implementation is very challenging. You must consider not only speed and space overhead, but also cache locality and fragmentation.
If you are interested in heap managers optimized for multithreaded environment, take a look a tcmalloc: http://goog-perftools.sourceforge.net/doc/tcmalloc.html

I see no problem in calling malloc() inside a new overload, just make sure you overload delete so it calls free(). But if you really don't want to call malloc(), one way is to just allocate enough memory another way:
class A {
public:
/* ... */
static void* operator new (size_t size) {
return (void *)new unsigned char[size];
}
static void operator delete (void *p) {
delete[]((unsigned char *)p);
}
/* ... */
};

realloc() for NUMA Systems using HWLOC

I have a several custom allocators that provide different means to allocate memory based on different policies. One of them allocates memory on a defined NUMA node. The interface to the allocator is straight-forward
template<typename config>
class NumaNodeStrategy
{
public:
static void *allocate(const size_t sz){}
static void *reallocate(void *old, size_t sz, size_t old_sz){}
static void deallocate(void *p, size_t sz){}
};
The allocation itself is handled using the hwloc_alloc_membind_nodeset() methods with the according parameters set for allocation policies etc. Howver, hwloc only provides methods for allocation and free'ing memory and I was wondering how should I implement reallocate().
Two possible solutions:
Allocate new memory area and memcpy() the data
Use hwloc_set_membind_nodeset() to set the memory allocation / binding policy for the nodeset and use plain malloc() / posix_memalign() and realloc().
Can anyone help me in getting this right?
Update:
I try to make the question more specific: Is there a possibility to perform a realloc() using hwloc without allocating new memory and moving the pages around?

To reply to the edit:
There's no realloc in hwloc, and we currently have no plan to add one. If you see preceisely what you want (C prototype of the function), feel free to add a ticket to https://svn.open-mpi.org/trac/hwloc
To reply to ogsx: The memory binding isn't specific, it's virtual memory area specific, and possibly thread-specific. If you realloc, the libc doesn't do anything special.
1) If it can realloc within the same page, you get memory on the same node. Good, but rare, especially for large buffers.
2) If it realloc in a different page (most of the cases for large buffers), it depends if the corresponding page have already been allocated in physical memory by the malloc lib in the past (malloc'ed and freed in virtual memory, but still allocated in physical memory)
2.a) If the virtual page has been allocated, it may have been allocated on another node for various reasons in the past, you're screwed.
2.b) If the new virtual page has not been allocated yet, the default is to allocate on the current node. If you specified a binding with set_area_membind() or mbind() earlier, it'll be allocated on the right node. You may be happy in this case.
In short, it depends on a lot of things. If you don't want to bother with the malloc lib doing complex/hidden internal things, and especially if your buffers are large, doing mmap(MAP_ANONYMOUS) instead of malloc is a simple way to be sure that pages are allocated when you really want them. And you even have mremap to do something similar to realloc.
alloc becomes mmap(length) + set_area_membind
realloc becomes mremap + set_area_membind (on the entire mremap'ed buffer)
Never used that but looks interesting.

The hwloc_set_area_membind_nodeset does the trick, doesn't it?
HWLOC_DECLSPEC int
hwloc_set_area_membind_nodeset (hwloc_topology_t topology,
const void *addr, size_t len, hwloc_const_nodeset_t nodeset,
hwloc_membind_policy_t policy, int flags)
Bind the already-allocated memory identified by (addr, len) to the NUMA node(s) in nodeset.
Returns:
-1 with errno set to ENOSYS if the action is not supported
-1 with errno set to EXDEV if the binding cannot be enforced
On linux, this call is implemented via mbind It works only if pages in the area was not touched, so it is just more correct way to move memory region in your second solution. UPDATE there is a MPOL_MF_MOVE* flags to move touched data.
The only syscall to move pages without reallocate-and-copy I know is move_pages
move_pages moves a set of pages in the address space of a executed process to a different NUMA node.

You're wrong. mbind can move pages that have been touched. You just need to add MPOL_MF_MOVE. That's what hwloc_set_area_membind_nodeset() does if you add the flag HWLOC_MEMBIND_MIGRATE.
move_pages is just a different way to do it (more flexible but a bit slower because you can move independant pages to different places). Both mbind with MPOL_MF_MOVE and move_pages (and migrate_pages) end up using the exact same migrate_pages() function in mm/migrate.c once they have converted the input into a list of pages.

After playing wav file, do I need to delete the buffer?

I am trying to implement classes implementing the wav playing, as explained in this example. The relevant code part is here :
/* Setup for conversion */
wav_cvt.buf = malloc(wav_len * wav_cvt.len_mult);
wav_cvt.len = wav_len;
memcpy(wav_cvt.buf, wav_buf, wav_len);
/* We can delete to original WAV data now */
SDL_FreeWAV(wav_buf);
/* And now we're ready to convert */
SDL_ConvertAudio(&wav_cvt);
When a wav file finishes playing (I am not going to play it again), do I need to free the memory buffer that is malloc()-ed above? Or is it done automatically somewhere?

No, nothing is done automatically. You must free it.

Remember that C (and anything of it's implementation) doesn't manage dynamic memory allocation automatically, whenever you have allocated some pieces of memory (mark the memory offset as USED), you should free() it when you are done to remark that offset as UNUSED. But that's not MUST!!!.

Any malloc is generally free'd elsewhere by the same module. I say generally because you may never intend to give the memory back for performance or persistence reasons. Furthermore, memory allocations will be reclaimed by the operating system when the process terminates regardless You're not endangering the system.
Since you malloced the buf, you should actually free it yourself. Save SDL_FreeWav for wave buffers passed to you by SDL that you are done with (such as from SDL_LoadWav).
Internal to SDL_LoadWav, will be a malloc call by SDL. SDL_FreeWav is a wrapper around the corresponding free. This allocate/deallocate function pairing is common as some libraries may implement custom memory management routines that resemble or wrap malloc and free. They may even open up new heap contexts that are not accessible from the standard functions, and intended to be private. There's not even a requirement that the memory be allocated on a heap, but this is orthogonal to your question.
It's likely that SDL_FreeWav is just a straight free, but when a library provides deallocation functions you should prefer those in case the behaviour differs.
When in doubt, always call the deallocation routine if you believe you're done with the memory resource. Double free errors are noisy and generally generate stack traces that will let you quickly identify the problem. Other libraries such as glib will usually have built-in diagnostics that will alert you to overzealous deallocation. Deallocating aggressively also aids in locating logical errors: If you think you're done with the memory, but some other part of the program isn't, the resource use will need to be re-examined.

C++ Using the new operator efficiently

When instantiating a class with new. Instead of deleting the memory what kinds of benefits would we gain based on the reuse of the objects?
What is the process of new? Does a context switch occur? New memory is allocated, who is doing the allocation? OS ?

You've asked a few questions here...
Instead of deleting the memory what kinds of benefits would we gain based on the reuse of the objects?
That depends entirely on your application. Even supposing I knew what the application is, you've left another detail unspecified -- what is the strategy behind your re-use? But even knowing that, it's very hard to predict or answer generically. Try some things and measure them.
As a rule of thumb I like to minimize the most gratuitous of allocations. This is mostly premature optimization, though. It'd only make a difference over thousands of calls.
What is the process of new?
Entirely implementation dependent. But the general strategy that allocators use is to have a free list, that is, a list of blocks which have been freed in the process. When the free list is empty or contains insufficient contiguous free space, it must ask the kernel for the memory, which it can only give out in blocks of a constant page size. (4096 on x86.) An allocator also has to decide when to chop up, pad, or coalesce blocks. Multi-threading can also put pressure on allocators because they must synchronize their free lists.
Generally it's a pretty expensive operation. Maybe not so much relative to what else you're doing. But it ain't cheap.
Does a context switch occur?Entirely possible. It's also possible that it won't. Your OS is free to do a context switch any time it gets an interrupt or a syscall, so uh... That can happen at a lot of times; I don't see any special relationship between this and your allocator.
New memory is allocated, who is doing the allocation? OS ?It might come from a free list, in which case there is no system call involved, hence no help from the OS. But it might come from the OS if the free list can't satisfy the request. Also, even if it comes from the free list, your kernel might have paged out that data, so you could get a page fault on access and the kernel's allocator would kick in. So I guess it'd be a mixed bag. Of course, you can have a conforming implementation that does all kinds of crazy things.

new allocates memory for the class on the heap, and calls the constructor.
context switches do not have to occur.
The c++-runtime allocates the memory on its freestore using whatever mechanism it deems fit.
Usually the c++ runtime allocates large blocks of memory using OS memory management functions, and then subdivides those up using its own heap implementation. The microsoft c++ runtime mostly uses the Win32 heap functions which are implemented in usermode, and divide up OS memory allocated using the virtual memory apis. There are thus no context switches until and unless its current allocation of virtual memory is needed and it needs to go to the OS to allocate more.
There is a theoretical problem when allocating memory that there is no upper bound on how long a heap traversal might take to find a free block. Practically tho, heap allocations are usually fast.
With the exception of threaded applications. Because most c++ runtimes share a single heap between multiple threads, access to the heap needs to be serialized. This can severly degrade the performance of certain classes of applications that rely on multiple threads being able to new and delete many objects.

If you new or delete an address it's marked as occupied or unassigned. The implementations do not talk all the time with the kernel. Bigger chucks of memory are reserved and divided in smaller chucks in user space within your application.
Because new and delete are re-entrant (or thread-safe depending on the implementation) a context switch may occur but your implementation is thread-safe anyway while using the default new and delete.
In C++ you are able to overwrite the new and delete operator, e.g. to place your memory management:
#include <cstdlib> //declarations of malloc and free
#include <new>
#include <iostream>
using namespace std;
class C {
public:
C();
void* operator new (size_t size); //implicitly declared as a static member function
void operator delete (void *p); //implicitly declared as a static member function
};
void* C::operator new (size_t size) throw (const char *){
void * p = malloc(size);
if (p == 0) throw "allocation failure"; //instead of std::bad_alloc
return p;
}
void C::operator delete (void *p){
C* pc = static_cast<C*>(p);
free(p);
}
int main() {
C *p = new C; // calls C::new
delete p; // calls C::delete
}

How does sbrk() work in C++?

Where can I read about sbrk() in some detail?
How does it exactly work?
In what situations would I want to use sbrk() instead of the cumbersome malloc() and new()?
btw, what is the expansion for sbrk()?

Have a look at the specification for brk/sbrk.
The call basically asks the OS to allocate some more memory for the application by incrementing the previous "break value" by a certain amount. This amount (the first parameter) is the amount of extra memory your application then gets.
Most rudimentary malloc implementations build upon the sbrk system call to get blocks of memory that they split up and track. The mmap function is generally accepted as a better choice (which is why mallocs like dlmalloc support both with an #ifdef).
As for "how it works", an sbrk at its most simplest level could look something like this:
uintptr_t current_break; // Some global variable for your application.
// This would probably be properly tracked by the OS for the process
void *sbrk(intptr_t incr)
{
uintptr_t old_break = current_break;
current_break += incr;
return (void*) old_break;
}
Modern operating systems would do far more, such as map pages into the address space and add tracking information for each block of memory allocated.

sbrk is pretty much obsolete, these days you'd use mmap to map some pages out of /dev/zero. It certainly isn't something you use instead of malloc and friends, it's more a way of implementing those. Also, of course, it exists only on posix-based operating systems that care about backwards compatibility to ancient code.
If you find Malloc and New too cumbersome, you should look into garbage collection instead... but beware, there is a potential performance cost to that, so you need to understand what you are doing.

You never want to use sbrk instead of malloc or free. It is non-portable and is typically used only by implementers of the standard C library or in cases where it's not available. It's described pretty well in its man page:
Description
brk() sets the end of the
data segment to the value specified by
end_data_segment, when that value is
reasonable, the system does have
enough memory and the process does not
exceed its max data size (see
setrlimit(2)).
sbrk() increments the program's data
space by increment bytes. sbrk() isn't
a system call, it is just a C library
wrapper. Calling sbrk() with an
increment of 0 can be used to find the
current location of the program break.
Return Value
On success, brk() returns
zero, and sbrk() returns a pointer to
the start of the new area. On error,
-1 is returned, and errno is set to ENOMEM.
Finally,malloc and free are not cumbersome - they are the standard way to allocate and release memory in C. Even if you want to implement your own memory allocator, it's best to just use malloc and free as the basis - a common approach is to allocate a large chunk at a time with malloc and provide memory allocation from it (this is what suballocators, or pools, usually implement)
Re the origin of the name sbrk (or its cousin brk), it may have something to do with the fact that the end of the heap is marked by a pointer known as the "break". The heap starts right after the BSS segments and typically grows up towards the stack.

You've tagged this C++ so why would you use 'cumbersome' malloc() rather than new? I am not sure what is cumbersome about malloc in any case; internally maybe so, but why would you care? And if you did care (for reasons of determinism for example), you could allocate a large pool and implement your own allocator for that pool. In C++ of course you can overload the new operator to do that.
sbrk is used to glue the C library to the underlying system's OS memory management. So make OS calls rather than using sbrk(). As to how it works, that is system dependent. If for example you are using the Newlib C library (commonly used on 'bare-metal' embedded systems with the GNU compiler), you have to implement sbrk yourself, so how it works in those circumstances is up to you so long as it achieves its required behaviour of extending the heap or failing.
As you can see from the link, it does not do much and would be extremely cumbersome to use directly - you'd probably end-up wrapping it in all the functionality that malloc and new provide in any case.

This depends on what you mean by malloc being "Cumbersome". sbrk is typically not used directly anymore, unless you're implementing your own memory allocator: IE, operator overriding "new". Even then I'd possibly use malloc to give me my initial memory.
If you'd like to see how to to implement malloc() on top of sbrk(), check out http://web.ics.purdue.edu/~cs354/labs/lab6/ which is an exercise going through that.
On a modern system you shouldn't touch this interface, though. Since you're calling malloc and new cumbersome, I suspect you don't have all the requisite experience to safely and properly use sbrk for your code.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js