realloc() for NUMA Systems using HWLOC - c++

I have a several custom allocators that provide different means to allocate memory based on different policies. One of them allocates memory on a defined NUMA node. The interface to the allocator is straight-forward
template<typename config>
class NumaNodeStrategy
{
public:
static void *allocate(const size_t sz){}
static void *reallocate(void *old, size_t sz, size_t old_sz){}
static void deallocate(void *p, size_t sz){}
};
The allocation itself is handled using the hwloc_alloc_membind_nodeset() methods with the according parameters set for allocation policies etc. Howver, hwloc only provides methods for allocation and free'ing memory and I was wondering how should I implement reallocate().
Two possible solutions:
Allocate new memory area and memcpy() the data
Use hwloc_set_membind_nodeset() to set the memory allocation / binding policy for the nodeset and use plain malloc() / posix_memalign() and realloc().
Can anyone help me in getting this right?
Update:
I try to make the question more specific: Is there a possibility to perform a realloc() using hwloc without allocating new memory and moving the pages around?

To reply to the edit:
There's no realloc in hwloc, and we currently have no plan to add one. If you see preceisely what you want (C prototype of the function), feel free to add a ticket to https://svn.open-mpi.org/trac/hwloc
To reply to ogsx: The memory binding isn't specific, it's virtual memory area specific, and possibly thread-specific. If you realloc, the libc doesn't do anything special.
1) If it can realloc within the same page, you get memory on the same node. Good, but rare, especially for large buffers.
2) If it realloc in a different page (most of the cases for large buffers), it depends if the corresponding page have already been allocated in physical memory by the malloc lib in the past (malloc'ed and freed in virtual memory, but still allocated in physical memory)
2.a) If the virtual page has been allocated, it may have been allocated on another node for various reasons in the past, you're screwed.
2.b) If the new virtual page has not been allocated yet, the default is to allocate on the current node. If you specified a binding with set_area_membind() or mbind() earlier, it'll be allocated on the right node. You may be happy in this case.
In short, it depends on a lot of things. If you don't want to bother with the malloc lib doing complex/hidden internal things, and especially if your buffers are large, doing mmap(MAP_ANONYMOUS) instead of malloc is a simple way to be sure that pages are allocated when you really want them. And you even have mremap to do something similar to realloc.
alloc becomes mmap(length) + set_area_membind
realloc becomes mremap + set_area_membind (on the entire mremap'ed buffer)
Never used that but looks interesting.

The hwloc_set_area_membind_nodeset does the trick, doesn't it?
HWLOC_DECLSPEC int
hwloc_set_area_membind_nodeset (hwloc_topology_t topology,
const void *addr, size_t len, hwloc_const_nodeset_t nodeset,
hwloc_membind_policy_t policy, int flags)
Bind the already-allocated memory identified by (addr, len) to the NUMA node(s) in nodeset.
Returns:
-1 with errno set to ENOSYS if the action is not supported
-1 with errno set to EXDEV if the binding cannot be enforced
On linux, this call is implemented via mbind It works only if pages in the area was not touched, so it is just more correct way to move memory region in your second solution. UPDATE there is a MPOL_MF_MOVE* flags to move touched data.
The only syscall to move pages without reallocate-and-copy I know is move_pages
move_pages moves a set of pages in the address space of a executed process to a different NUMA node.

You're wrong. mbind can move pages that have been touched. You just need to add MPOL_MF_MOVE. That's what hwloc_set_area_membind_nodeset() does if you add the flag HWLOC_MEMBIND_MIGRATE.
move_pages is just a different way to do it (more flexible but a bit slower because you can move independant pages to different places). Both mbind with MPOL_MF_MOVE and move_pages (and migrate_pages) end up using the exact same migrate_pages() function in mm/migrate.c once they have converted the input into a list of pages.

Related

How to check how much memory has already been allocated by new()? [duplicate]

How to detect programmatically count of bytes allocated by process on Heap?
This test should work from process itself.
I think mallinfo() is what you want:
#include <malloc.h>
struct mallinfo *info;
info = mallinfo();
printf ("total allocated space: %llu bytes\n", info->uordblks);
printf ("total free space: %llu bytes\n", info->fordblks);
The struct mallinfo structure is technical, and specific to the malloc() implementation. But the information you want is in there. Here is how I report the values:
mallinfo.arena = "Total Size (bytes)"
mallinfo.uordblks = "Busy size (bytes)"
mallinfo.fordblks = "Free size (bytes)"
mallinfo.ordblks = "Free blocks (count)"
mallinfo.keepcost = "Top block size (bytes)"
mallinfo.hblks = "Blocks mapped via mmap() (count)"
mallinfo.hblkhd = "Bytes mapped via mmap() (bytes)"
These two are allegedly not used, but they seem to change on my system, and thus might be valid:
mallinfo.smblks = "Fast bin blocks (count)"
mallinfo.fsmblks = "Fast bin bytes (bytes)"
And the other interesting value is returned by "sbrk (0)"
There are a number of possibilities.
How accurate do you need it to be? You can get some useful data via cat /proc/${PID}/status | grep VmData.
You can #define your own malloc(), realloc(), calloc(), and free() functions, wrapping the real functions behind your own counter. You can do really cool things here with __FILE__, __LINE__, & __func__ to facilitate identifying core leaks in simple tests. But it will only instrument your own code!
(Similarly, you can also redefine the default operator new and operator delete methods, both array and non-array variants, and both throwing std::bad_alloc and std::nothrow_t variants. Again, this will only instrument your own code!)
(Be aware: On most C++ systems, new ultimately invokes malloc(). It doesn't have to. Especially with in-place new! But typically new does make use of malloc(). (Or it operates on a region of memory that has previously been malloc()'ed.) Otherwise you'd get into really funky stuff with multiple heap managers...)
You can use sbrk(0) to see where the data segment is currently set. That's not so great. It's a very coarse measurement, and it doesn't account for holes (unused memory regions) in the heap. (You're much better off with the VmData line from /proc/${PID}/status.) But if you're just looking for a general idea...
You can trap malloc()/free()/etc by writing your own shared library and forcing your process to use it instead of the real versions via LD_PRELOAD. You can use dlopen()/dlsym() to load & invoke the *real* malloc()/free()/etc. This works quite beautifully. The original code is unmodified, not even recompiled. But be aware of re-entrant situations when coding this library, and that your process will initially invoke malloc()/calloc()/realloc() before dlopen()/dlsym() can complete loading the real functions.
You might check out tools like Valgrind, though that's really aimed more at memory leaks.
Then again, perhaps mtrace() is what you want? Or __malloc_hook? Very proprietary (GNU) & nonstandard... But you are tagged "Linux"...
There's no easy, automatic way to do it, if that's what you're asking. You basically have to manually keep track of heap allocations yourself using a counter variable. The problem is that it's difficult to control which parts of your program are allocating memory on the heap, especially if you're using a lot of libraries out of your control. To complicate things further, there's two ways a program might allocate heap memory: new or malloc. (Not to mention direct OS calls like sbrk.)
You can override global operator new, and have each call to new increase a global tally. However, this won't necessarily include times when your program calls malloc, or when your program uses some class-specific new override. You can also override malloc using a macro, but this is not necessarily portable. And you'd also have to override all the variations of malloc, like realloc, calloc, etc. All of this is further complicated by the fact that on some implementations, new itself may call malloc.
So, essentially, it's very difficult to do this properly from within your program. I'd recommend using a memory profiler tool instead.
A speculative solution: redefine new and delete operators.
On each new operator call, a number of bytes to allocate is passed. Allocate a bit more memory and store the amount of bytes allocated within. Add this amount to global variable that holds the heap size.
On delete operator call, check the value you stored before you dispose the memory. Subtract it from that global variable.
If you're on windows, you can use GetProcessHeap(), HeapQueryInfo() to retrieve information about the processes heap. An example of walking the heap from MSDN
Since you've tagged your question 'linux' it might help to look at some of the information provided in the /proc directory. I haven't researched this a lot so I can only give you a starting point.
/proc/<your programs pid> contains files with some information about your process from the viewpoint of the kernel. There is a symlink /proc/self that will always about the process your investigating this from.
The files you might be most interested in are stat, statm and status. The latter is more human-readable, whereas the former two give the same info in a more machine-readable format.
A starting point about how to interpret the content of those files is available in the proc(5) manpage.
Other then keeping track of all your memory allocations I don't believe there is a way to calculate the heap size or usage
Use malloc_info(). Have fun with the XML aspect of it. mallinfo() (as shown in another answer) does a similar thing but is restricted to 32-bit values... a foolish thing to rely on in 2010+.

Implementing realloc in CUDA without moving data

According to this question and reference NVIDIA CUDA Programming Guide the realloc function is not implemented:
The CUDA in-kernel malloc() function allocates at least size bytes
from the device heap and returns a pointer to the allocated memory or
NULL if insufficient memory exists to fulfill the request. The
returned pointer is guaranteed to be aligned to a 16-byte boundary.
The CUDA in-kernel free() function deallocates the memory pointed to
by ptr, which must have been returned by a previous call to malloc().
If ptr is NULL, the call to free() is ignored. Repeated calls to
free() with the same ptr has undefined behavior.
I am currectly stuck with some portion of GMP library (or more strictly my attempt to port it on CUDA), which relies on this functionaliy:
__host__ __device__ static void * // generate this function for both CPU and GPU
gmp_default_realloc (void *old, size_t old_size, size_t new_size)
{
mp_ptr p;
#if __CUDA_ARCH__ // this directive separates device and host code
/* ? */
#else
p = (mp_ptr) realloc (old, new_size); /* host code has realloc from glibc */
#endif
if (!p)
gmp_die("gmp_default_realoc: Virtual memory exhausted.");
return p;
}
Essentially I can just simply call malloc with new_size, then call memcpy (or maybe memmove), then free previous block, but this requires obligatory moving of data (large arrays), which I would like to avoid.
Is there any effective efficient way to implement (standard C or C++) realloc function (i.e. inside kernel) ? Let's say that I have some large array of dynamically allocated data (already allocated by malloc), then in some other place realloc is invoked in order to request some larger amount of memory for that block. In short I would like to avoid copying whole data array into new location and I ask specifically how to do it (of course if it's possible at all).
I am not especially familiar with PTX ISA or underlying implementation of in-kernel heap functions, but maybe it's worth a look into that direction ?
Most malloc implementations over-allocate, this is the reason why realloc can sometimes avoid copying bytes - the old block may be large enough for the new size. But apparently in your environment the system malloc doesn't do that, so I think your only option is to reimplement all 3 primitives, gmp_default_{alloc,realloc,free} on top of the system-provided malloc/free.
There are many open-source malloc implementation out there, glibc has one you might be able to adapt.
I'm not familiar with CUDA or GMP, but off the top of my head:
gmp_malloc() followed by plain free() probably works on "normal" platforms, but will likely cause heap corruption if you go ahead with this
if all you want is a more efficient realloc, you can simply overallocate in your custom malloc (up to some size, say the nearest power of 2), just so you can avoid copying in the subseauent re-alloc. You don't even need a full-blown heap implementation for that.
your implementation may need to use a mutex or some such to protect your heap against concurrent modifications
you can improve performance even more if you never (or infrequently) return the malloc()ed blocks back to the OS from within your custom heap, I.e keep the gmp_free()ed blocks around for subsequent reuse instead of calling the system free() on them immediately
come to think of it, a better idea would be to introduce a sane malloc implementation into that platform, outside of your GMP lib, so that other programs and libraries could draw their memory from the same pool, instead of GMP doing one thing and everything else doing something else. This should help with the overall memory consumption w.r.t previous point. Maybe you should port glibc first :)

C++ using a Garbage Collector is overkill, what is a better solution?

I am currently using Boehm Garbage Collector for a large application in C++. While it works, it seems to me that the GC is overkill for my purpose (I do not like having this as a dependency and I have to continually make allowances and think about the GC in everything I do so as to not step on its toes). I would like to find a better solution that is more suited to my needs, rather than a blanket solution that happens to cover it.
In my situation I have one specific class (and everything that inherits from that class) that I want to "collect". I do not need general garbage collection, in all situations except for this particular class I can easily manage my own memory.
Before I started using the GC, I used reference counting, but reference cycles and the frequent updates made this a less than ideal solution.
Is there a better way for me to keep track of this class? One that does not involve additional library dependancies like boost.
Edit:
It is probably best if I give a rundown on the potential lifespan of my object(s).
A function creates a new instance of my class and may (or may not) use it. Regardless, it passes this new instance back to the caller as a return value. The caller may (or may not) use it as well, and again it passes it back up the stack, eventually getting to the top level function which just lets the pointer fade into oblivion.
I cannot just delete the pointer in the top level, because part of the "possible use", involves passing the pointer to other functions which may (or may not) store the pointer for use somewhere else, at some future time.
I hope this better illustrates the problem that I am trying to solve. I currently solve it with Boehm Garbage Collector, but would like simpler, non dependency involving, solution if possible.
In the Embedded Systems world, or programs that are real-time event critical, garbage collection is frowned upon. The point of using dynamic memory is bad.
With dynamic memory allocation, fragmentation occurs. A Garbage Collector is used to periodically arrange memory to reduce the fragmentation, such as combining sequential freed blocks. The primary issue is when to perform this defragmentation or running of the GC.
Some suggested alternatives:
Redesign your system to avoid dynamic memory allocation.
Allocate static buffers and use them. For example in an RTOS system, preallocate space for messages, rather than dynamically allocating them.
Use the Stack, not the Heap.
Use the stack for dynamically allocated variables, if possible. This is not a good idea if variables need a lifetime beyond the function execution.
Place limits on variable sized data.
Along with static buffers, place limits on variable length data or incoming data of unknown size. This may mean that the incoming data must be paused or multiple buffering when the input cannot be stopped.
Create your own memory allocator.
Create many memory pools that allocate different sized blocks. This will reduce fragmentation. For example, for small blocks, maybe a bitset could be used to determine which bytes are in use and which are available. Maybe another pool for 64 byte blocks is necessary. All depends on your system's needs.
If you really just need special handling for the memory allocations associated with a single class, then you should look at overloading the new operator for that class.
class MyClass
{
public:
void *operator new(size_t);
void operator delete(void *);
}
You can implement these operators to do whatever you need to track the memory: allocate it from a special pool, place references on a linked list for tracking, etc.
void* MyClass::operator new(size_t size)
{
void *p = my_allocator(size); // e.g., instead of malloc()
// place p on a linked list, etc.
return p;
}
void MyClass::operator delete(void *p)
{
// remove p from list...
my_free(p);
}
You can then write external code that can walk through the list you are keeping to inspect every currently-allocated instance of MyClass, GC'ing instances as appropriate for your situation.
With memory, you should always try and have clear ownership and knowledge of lifetime. Lifetime determines where you take the memory from (as do other factors), ie stack for scope lived, pool for reused, etc. Ownership will tell you when and if to free memory. In your case, the GC has the ownership and makes the decision when to free. With ref counting, the wrapper class does this logic. Unclear ownership leads to hard to maintain code if manual memory management is used. You must avoid use after free, double frees, and memory leaking.
To solve your problem, figure out who should keep ownership. This will dictate the algoritm to use. GC and ref counting are popular choices, but there are infinetly many. If ownership is unclear, give it to a 3rd party whose job it is to keep track of it. If ownership is shared, make sure all parties are aware of it perhaps by enforcing it via specialized classes. This can also be enforced by simple convention, ie objects of type foo should never keep ptrs of type bar internally as they do not own them and if they do they cannot assume them always valid and might have to check for validity first. Etc.
If you find this hard to determine, it could be a sign that the code is very complex. Could it be made in a more simple manner?
Understanding how your memory is used and accessed is key to writing clean code for maintenance and performance optimizations. This is true regardless of language used.
Best of luck.

New vs. Malloc, when overloading New

I'm overloading new and delete to implement my own small-objects/thread-safe allocator.
The problem is that when I am overloading new, I cannot use new without breaking universal causality or at least the compiler. Most examples I found where new is overloaded, use Malloc() to do the actual allocation. But from what I understood of C++, there is no use-case for Malloc() at all.
Multiple answers similar to this one, some with less tort outside of SO: In what cases do I use malloc vs new?
My question, is how do I allocate the actual memory when overloading operator new without using Malloc() ?
(This is out of curiosity more than anything, try not to take the reasoning behind the overload too seriously; I have a seperate question out on that anywho!)
Short answer: if you don't want existing malloc, you need to implement your own heap manager.
A heap manager, for example malloc in glibc of Linux, HeapAlloc in Windows, is a user-level algorithm. First, keep in mind that heap is optimized for allocating small sizes of objects like 4~512 bytes.
How to implement your own heap manager? At least, you must call a system API that allocates a memory chunk in your process. There are VirtualAlloc for Windows and sbrk for Linux. These APIs allocate a large chunk of memory, but the size must be multiple of page size. Typically, the size of page in x86 and Windows/Linux is 4KB.
After obtaining a chunk of page, you need to implement your own algorithms how to chop down this big memory into smaller requests. A classic (still very practical) implementation and algorithm is dlmalloc: http://g.oswego.edu/dl/html/malloc.html
To implement, you need to have several data structures for book-keeping and a number of policies for optimization. For example, for small objects like 16, 20, 36, 256 bytes, a heap manager maintains a list of blocks of each size. So, there are a list of lists. If requested size is bigger than a page size, then it just call VirtualAlloc or sbrk. However, an efficient implementation is very challenging. You must consider not only speed and space overhead, but also cache locality and fragmentation.
If you are interested in heap managers optimized for multithreaded environment, take a look a tcmalloc: http://goog-perftools.sourceforge.net/doc/tcmalloc.html
I see no problem in calling malloc() inside a new overload, just make sure you overload delete so it calls free(). But if you really don't want to call malloc(), one way is to just allocate enough memory another way:
class A {
public:
/* ... */
static void* operator new (size_t size) {
return (void *)new unsigned char[size];
}
static void operator delete (void *p) {
delete[]((unsigned char *)p);
}
/* ... */
};

C++ Using the new operator efficiently

When instantiating a class with new. Instead of deleting the memory what kinds of benefits would we gain based on the reuse of the objects?
What is the process of new? Does a context switch occur? New memory is allocated, who is doing the allocation? OS ?
You've asked a few questions here...
Instead of deleting the memory what kinds of benefits would we gain based on the reuse of the objects?
That depends entirely on your application. Even supposing I knew what the application is, you've left another detail unspecified -- what is the strategy behind your re-use? But even knowing that, it's very hard to predict or answer generically. Try some things and measure them.
As a rule of thumb I like to minimize the most gratuitous of allocations. This is mostly premature optimization, though. It'd only make a difference over thousands of calls.
What is the process of new?
Entirely implementation dependent. But the general strategy that allocators use is to have a free list, that is, a list of blocks which have been freed in the process. When the free list is empty or contains insufficient contiguous free space, it must ask the kernel for the memory, which it can only give out in blocks of a constant page size. (4096 on x86.) An allocator also has to decide when to chop up, pad, or coalesce blocks. Multi-threading can also put pressure on allocators because they must synchronize their free lists.
Generally it's a pretty expensive operation. Maybe not so much relative to what else you're doing. But it ain't cheap.
Does a context switch occur?Entirely possible. It's also possible that it won't. Your OS is free to do a context switch any time it gets an interrupt or a syscall, so uh... That can happen at a lot of times; I don't see any special relationship between this and your allocator.
New memory is allocated, who is doing the allocation? OS ?It might come from a free list, in which case there is no system call involved, hence no help from the OS. But it might come from the OS if the free list can't satisfy the request. Also, even if it comes from the free list, your kernel might have paged out that data, so you could get a page fault on access and the kernel's allocator would kick in. So I guess it'd be a mixed bag. Of course, you can have a conforming implementation that does all kinds of crazy things.
new allocates memory for the class on the heap, and calls the constructor.
context switches do not have to occur.
The c++-runtime allocates the memory on its freestore using whatever mechanism it deems fit.
Usually the c++ runtime allocates large blocks of memory using OS memory management functions, and then subdivides those up using its own heap implementation. The microsoft c++ runtime mostly uses the Win32 heap functions which are implemented in usermode, and divide up OS memory allocated using the virtual memory apis. There are thus no context switches until and unless its current allocation of virtual memory is needed and it needs to go to the OS to allocate more.
There is a theoretical problem when allocating memory that there is no upper bound on how long a heap traversal might take to find a free block. Practically tho, heap allocations are usually fast.
With the exception of threaded applications. Because most c++ runtimes share a single heap between multiple threads, access to the heap needs to be serialized. This can severly degrade the performance of certain classes of applications that rely on multiple threads being able to new and delete many objects.
If you new or delete an address it's marked as occupied or unassigned. The implementations do not talk all the time with the kernel. Bigger chucks of memory are reserved and divided in smaller chucks in user space within your application.
Because new and delete are re-entrant (or thread-safe depending on the implementation) a context switch may occur but your implementation is thread-safe anyway while using the default new and delete.
In C++ you are able to overwrite the new and delete operator, e.g. to place your memory management:
#include <cstdlib> //declarations of malloc and free
#include <new>
#include <iostream>
using namespace std;
class C {
public:
C();
void* operator new (size_t size); //implicitly declared as a static member function
void operator delete (void *p); //implicitly declared as a static member function
};
void* C::operator new (size_t size) throw (const char *){
void * p = malloc(size);
if (p == 0) throw "allocation failure"; //instead of std::bad_alloc
return p;
}
void C::operator delete (void *p){
C* pc = static_cast<C*>(p);
free(p);
}
int main() {
C *p = new C; // calls C::new
delete p; // calls C::delete
}