New vs. Malloc, when overloading New

New vs. Malloc, when overloading New - c++

I'm overloading new and delete to implement my own small-objects/thread-safe allocator.
The problem is that when I am overloading new, I cannot use new without breaking universal causality or at least the compiler. Most examples I found where new is overloaded, use Malloc() to do the actual allocation. But from what I understood of C++, there is no use-case for Malloc() at all.
Multiple answers similar to this one, some with less tort outside of SO: In what cases do I use malloc vs new?
My question, is how do I allocate the actual memory when overloading operator new without using Malloc() ?
(This is out of curiosity more than anything, try not to take the reasoning behind the overload too seriously; I have a seperate question out on that anywho!)

Short answer: if you don't want existing malloc, you need to implement your own heap manager.
A heap manager, for example malloc in glibc of Linux, HeapAlloc in Windows, is a user-level algorithm. First, keep in mind that heap is optimized for allocating small sizes of objects like 4~512 bytes.
How to implement your own heap manager? At least, you must call a system API that allocates a memory chunk in your process. There are VirtualAlloc for Windows and sbrk for Linux. These APIs allocate a large chunk of memory, but the size must be multiple of page size. Typically, the size of page in x86 and Windows/Linux is 4KB.
After obtaining a chunk of page, you need to implement your own algorithms how to chop down this big memory into smaller requests. A classic (still very practical) implementation and algorithm is dlmalloc: http://g.oswego.edu/dl/html/malloc.html
To implement, you need to have several data structures for book-keeping and a number of policies for optimization. For example, for small objects like 16, 20, 36, 256 bytes, a heap manager maintains a list of blocks of each size. So, there are a list of lists. If requested size is bigger than a page size, then it just call VirtualAlloc or sbrk. However, an efficient implementation is very challenging. You must consider not only speed and space overhead, but also cache locality and fragmentation.
If you are interested in heap managers optimized for multithreaded environment, take a look a tcmalloc: http://goog-perftools.sourceforge.net/doc/tcmalloc.html

I see no problem in calling malloc() inside a new overload, just make sure you overload delete so it calls free(). But if you really don't want to call malloc(), one way is to just allocate enough memory another way:
class A {
public:
/* ... */
static void* operator new (size_t size) {
return (void *)new unsigned char[size];
}
static void operator delete (void *p) {
delete[]((unsigned char *)p);
}
/* ... */
};

Related

Implementing realloc in CUDA without moving data

According to this question and reference NVIDIA CUDA Programming Guide the realloc function is not implemented:
The CUDA in-kernel malloc() function allocates at least size bytes
from the device heap and returns a pointer to the allocated memory or
NULL if insufficient memory exists to fulfill the request. The
returned pointer is guaranteed to be aligned to a 16-byte boundary.
The CUDA in-kernel free() function deallocates the memory pointed to
by ptr, which must have been returned by a previous call to malloc().
If ptr is NULL, the call to free() is ignored. Repeated calls to
free() with the same ptr has undefined behavior.
I am currectly stuck with some portion of GMP library (or more strictly my attempt to port it on CUDA), which relies on this functionaliy:
__host__ __device__ static void * // generate this function for both CPU and GPU
gmp_default_realloc (void *old, size_t old_size, size_t new_size)
{
mp_ptr p;
#if __CUDA_ARCH__ // this directive separates device and host code
/* ? */
#else
p = (mp_ptr) realloc (old, new_size); /* host code has realloc from glibc */
#endif
if (!p)
gmp_die("gmp_default_realoc: Virtual memory exhausted.");
return p;
}
Essentially I can just simply call malloc with new_size, then call memcpy (or maybe memmove), then free previous block, but this requires obligatory moving of data (large arrays), which I would like to avoid.
Is there any effective efficient way to implement (standard C or C++) realloc function (i.e. inside kernel) ? Let's say that I have some large array of dynamically allocated data (already allocated by malloc), then in some other place realloc is invoked in order to request some larger amount of memory for that block. In short I would like to avoid copying whole data array into new location and I ask specifically how to do it (of course if it's possible at all).
I am not especially familiar with PTX ISA or underlying implementation of in-kernel heap functions, but maybe it's worth a look into that direction ?

Most malloc implementations over-allocate, this is the reason why realloc can sometimes avoid copying bytes - the old block may be large enough for the new size. But apparently in your environment the system malloc doesn't do that, so I think your only option is to reimplement all 3 primitives, gmp_default_{alloc,realloc,free} on top of the system-provided malloc/free.
There are many open-source malloc implementation out there, glibc has one you might be able to adapt.
I'm not familiar with CUDA or GMP, but off the top of my head:
gmp_malloc() followed by plain free() probably works on "normal" platforms, but will likely cause heap corruption if you go ahead with this
if all you want is a more efficient realloc, you can simply overallocate in your custom malloc (up to some size, say the nearest power of 2), just so you can avoid copying in the subseauent re-alloc. You don't even need a full-blown heap implementation for that.
your implementation may need to use a mutex or some such to protect your heap against concurrent modifications
you can improve performance even more if you never (or infrequently) return the malloc()ed blocks back to the OS from within your custom heap, I.e keep the gmp_free()ed blocks around for subsequent reuse instead of calling the system free() on them immediately
come to think of it, a better idea would be to introduce a sane malloc implementation into that platform, outside of your GMP lib, so that other programs and libraries could draw their memory from the same pool, instead of GMP doing one thing and everything else doing something else. This should help with the overall memory consumption w.r.t previous point. Maybe you should port glibc first :)

C++ memory management and Misra

I need some clarification about c++ memory management and MISRA guidelines..
I have to implement one program that it's MISRA compatible so I have to respect a important rule: is not possible to use 'new' operator (dynamic memory heap).
In this case, for any custom object, I must use static allocation:
For example:
I have my class Student with a constructor Student(int age).
Whenever I have to instantiate a Student object I must do it this way:
int theAge = 18;
Student exampleOfStudent(theAge);
This creates an Student object exampleOfStudent.
In this way I do not to have to worry about I do not use destructors.
Is this correct all this?
Are there other ways to use static memory management?
Can I use in the same way std::vector or other data structure?
Can I add, for example, a Student instance (that I created as Student exampleOfStudent(theAge)) into a std::vector.

Student exampleOfStudent(theAge); is an automatic variable, not static.
As far as I remember, MISRA rules disallow all forms of dynamic memory. This includes both malloc and new and std::vector (with the default allocator).
You are left with only automatic variables and static variables.
If your system has a limited amount of RAM you don't want to use dynamic memory because of the risk you will ask for more memory than is available. Heap fragmentation is also an issue. This prevents you from writing provably correct code. If you use variables with automatic or static storage a static analysis application can, for instance, output the maximum amount of memory your application will use. This number you can check against your system RAM.

The idea behind the rule is not that malloc and new, specifically, are unsafe, but that memory allocation is (usually) a lazy workaround for not understanding, or managing, the memory requirements of your program.
pre-allocating your calculated maximum input, and trapping overruns
providing a packet, stream, or other line-oriented means of managing input
use of an alternative pre-allocated data structure to manage non-uniform elements
Particularly in the context of a small, non-MMU, embedded system that lack of design depth frequently leads to an unstable system, that crashes outright in those odd, "corner case" exceptions. Small memory, short stack, is a system killer.
A few, of many, strategies that avoid the assumption that you do not have infinite memory, or even much memory in that inexpensive, embedded system - and force you to deal with the faults that might be important in your application.
Don't write your own malloc.

For MISRA compliance, placement-new is not a problem, as there is no dynamic allocation happening.
A library could be written (like an STL allocator) in such a way as to reference a statically allocated memory region as it's memory pool for such a purpose.
Advantages: deterministic, fast.
Disadvantages: memory inefficient.
A favorable trade off for deterministic real-time systems.
All needed RAM has to be there at program startup, or the program won't run.
If the program starts, it's unaffected by available heap size, fragmentation etc..
Writing ones own allocator can be complex and out-of-memory conditions (static memory pool size is fixed after all) still have to be dealt with.

I once wrote a library that had to comply to the MISRA rules. I needed dynamic memory as well, so I came up with a trick:
My lib was written in C, but my trick may work for you.
Part of the header-file looked like this:
/* declare two function pointers compatible to malloc and free: */
typedef void * (*allocatorFunc)(size_t size);
typedef void (*freeFunc) (void * data);
/* and let the library user pass them during lib-init: */
int library_init (allocatorFunc allocator, freeFunc deallocator);
Inside the library I never called malloc/free directly. I always used the supplied function-pointers. So I delegated the problem how to the dynamic memory allocation should look like to someone else.
The customer actually liked this solution. He was aware of the fact that my library would not work without dynamic memory allocation and it gave him freedom to implement his own memory scheme using preallocated pools or whatnot.
In C++ you can do the same, just use the malloc function and do the object creation using placement new.

using a vector in a custom memory manager

I am writing a memory manager in c++. The aim is to allocate a set amount of memory at the start using malloc and then overload new and delete so that it uses that memory. I almost have it working my only problem is how i am keeping track of what is where in the memory.
I created a vector of structs which holds information such as size, location and if it is free or not.
The problem is when i call push_back it attempts to use my overloaded new function. This is where it fails because it can't use my overloaded new until it has pushed back the first structure of information.
Does anyone know how i can resolve this or a better way to keep track of the memory?

Don't overload global operator new!
The easiest and (WARNING; subjective ->) best solution would be to define your own Allocator which you'll use when dealing with allocation on the free-store (aka. heap). All STL containers have support for passing an AllocatorType as a template argument.
Overloading global operator new/operator delete might seem like a neat solution, but I can almost guarantee you that it will cause you troubles as the developing goes by.
Inside this custom made allocator you can keep track of what goes where, but make the internal std::vector (or whatever you'd like to use, a std::map seems more fitting to me) will use the default operator new/operator delete.
How do I create my own allocator?
The link below will lead you to a nice document with information regarding this matter:
stdcxx.apache.org - Building Your Own Allocators (heavily recommended)
Using a custom allocator when required/wanted will make you not run into any chicken and egg problem when trying to allocate memory for the allocator that will allocate memory, but the allocator must have allocated memory to use the allocator methods.. and what will allocate memory for the allocator but the allocator? Well we will need to allocate memory for that allocator and that allocator must have it's own allocator, though that allocator need memory, provided by another allocator?
Maybe I should just get myself a dog instead, they don't lay eggs - right?

create a class and overload new only in this class. you will not have problems with your vector. you will be able to use your own new with ::new A and the normal new with new A
class C
{
public:
void* operator new( size_t n ) ;
// ...
} ;
otherwise, you can use your own operator function rather than overload operator new :
a basic idea of an allocator :
int *i = myGetMem(i); // and myGetMem() allocates sizeof(*i) bytes of memory.
so you will not have problems with using the vector.
in fact, a real memory allocator keeps the information you put on the vector in the memory allocated it self :
you can take an algorithm for getmem/freemem to adapt it to your case. it can be helpfull.
e.g. : i want to allocate 10 bytes, the memory at #1024 contain information about memory allocated and the allocator returns an adress after 1024, maybe #1030 (depending of the information stored) as the start of allocated memory. so the user gets adress 1030 and he has memory between 1030 and 103A.
when calling the deallocator, the information at the beginning is used to correctly free the memory and to put it back in the list of avaible memory.
(the list of availvle memory is stored in popular alorithms in an array of linked lists of free memories organized by size with algorithms to avoid and minimize fragmentation)
this can resolve your need to the vector.

You can create a vector using any custom allocator.
It is declared in the following manner:
std::vector<YourStruct, YourOwnAllocator> memory_allocations;
YourOwnAllocator is going to be a class which will allocate the data needed for the vector bypassing your overloaded operators.
In needs to provide all the methods and typedefs listed here.

realloc() for NUMA Systems using HWLOC

I have a several custom allocators that provide different means to allocate memory based on different policies. One of them allocates memory on a defined NUMA node. The interface to the allocator is straight-forward
template<typename config>
class NumaNodeStrategy
{
public:
static void *allocate(const size_t sz){}
static void *reallocate(void *old, size_t sz, size_t old_sz){}
static void deallocate(void *p, size_t sz){}
};
The allocation itself is handled using the hwloc_alloc_membind_nodeset() methods with the according parameters set for allocation policies etc. Howver, hwloc only provides methods for allocation and free'ing memory and I was wondering how should I implement reallocate().
Two possible solutions:
Allocate new memory area and memcpy() the data
Use hwloc_set_membind_nodeset() to set the memory allocation / binding policy for the nodeset and use plain malloc() / posix_memalign() and realloc().
Can anyone help me in getting this right?
Update:
I try to make the question more specific: Is there a possibility to perform a realloc() using hwloc without allocating new memory and moving the pages around?

To reply to the edit:
There's no realloc in hwloc, and we currently have no plan to add one. If you see preceisely what you want (C prototype of the function), feel free to add a ticket to https://svn.open-mpi.org/trac/hwloc
To reply to ogsx: The memory binding isn't specific, it's virtual memory area specific, and possibly thread-specific. If you realloc, the libc doesn't do anything special.
1) If it can realloc within the same page, you get memory on the same node. Good, but rare, especially for large buffers.
2) If it realloc in a different page (most of the cases for large buffers), it depends if the corresponding page have already been allocated in physical memory by the malloc lib in the past (malloc'ed and freed in virtual memory, but still allocated in physical memory)
2.a) If the virtual page has been allocated, it may have been allocated on another node for various reasons in the past, you're screwed.
2.b) If the new virtual page has not been allocated yet, the default is to allocate on the current node. If you specified a binding with set_area_membind() or mbind() earlier, it'll be allocated on the right node. You may be happy in this case.
In short, it depends on a lot of things. If you don't want to bother with the malloc lib doing complex/hidden internal things, and especially if your buffers are large, doing mmap(MAP_ANONYMOUS) instead of malloc is a simple way to be sure that pages are allocated when you really want them. And you even have mremap to do something similar to realloc.
alloc becomes mmap(length) + set_area_membind
realloc becomes mremap + set_area_membind (on the entire mremap'ed buffer)
Never used that but looks interesting.

The hwloc_set_area_membind_nodeset does the trick, doesn't it?
HWLOC_DECLSPEC int
hwloc_set_area_membind_nodeset (hwloc_topology_t topology,
const void *addr, size_t len, hwloc_const_nodeset_t nodeset,
hwloc_membind_policy_t policy, int flags)
Bind the already-allocated memory identified by (addr, len) to the NUMA node(s) in nodeset.
Returns:
-1 with errno set to ENOSYS if the action is not supported
-1 with errno set to EXDEV if the binding cannot be enforced
On linux, this call is implemented via mbind It works only if pages in the area was not touched, so it is just more correct way to move memory region in your second solution. UPDATE there is a MPOL_MF_MOVE* flags to move touched data.
The only syscall to move pages without reallocate-and-copy I know is move_pages
move_pages moves a set of pages in the address space of a executed process to a different NUMA node.

You're wrong. mbind can move pages that have been touched. You just need to add MPOL_MF_MOVE. That's what hwloc_set_area_membind_nodeset() does if you add the flag HWLOC_MEMBIND_MIGRATE.
move_pages is just a different way to do it (more flexible but a bit slower because you can move independant pages to different places). Both mbind with MPOL_MF_MOVE and move_pages (and migrate_pages) end up using the exact same migrate_pages() function in mm/migrate.c once they have converted the input into a list of pages.

C++ Using the new operator efficiently

When instantiating a class with new. Instead of deleting the memory what kinds of benefits would we gain based on the reuse of the objects?
What is the process of new? Does a context switch occur? New memory is allocated, who is doing the allocation? OS ?

You've asked a few questions here...
Instead of deleting the memory what kinds of benefits would we gain based on the reuse of the objects?
That depends entirely on your application. Even supposing I knew what the application is, you've left another detail unspecified -- what is the strategy behind your re-use? But even knowing that, it's very hard to predict or answer generically. Try some things and measure them.
As a rule of thumb I like to minimize the most gratuitous of allocations. This is mostly premature optimization, though. It'd only make a difference over thousands of calls.
What is the process of new?
Entirely implementation dependent. But the general strategy that allocators use is to have a free list, that is, a list of blocks which have been freed in the process. When the free list is empty or contains insufficient contiguous free space, it must ask the kernel for the memory, which it can only give out in blocks of a constant page size. (4096 on x86.) An allocator also has to decide when to chop up, pad, or coalesce blocks. Multi-threading can also put pressure on allocators because they must synchronize their free lists.
Generally it's a pretty expensive operation. Maybe not so much relative to what else you're doing. But it ain't cheap.
Does a context switch occur?Entirely possible. It's also possible that it won't. Your OS is free to do a context switch any time it gets an interrupt or a syscall, so uh... That can happen at a lot of times; I don't see any special relationship between this and your allocator.
New memory is allocated, who is doing the allocation? OS ?It might come from a free list, in which case there is no system call involved, hence no help from the OS. But it might come from the OS if the free list can't satisfy the request. Also, even if it comes from the free list, your kernel might have paged out that data, so you could get a page fault on access and the kernel's allocator would kick in. So I guess it'd be a mixed bag. Of course, you can have a conforming implementation that does all kinds of crazy things.

new allocates memory for the class on the heap, and calls the constructor.
context switches do not have to occur.
The c++-runtime allocates the memory on its freestore using whatever mechanism it deems fit.
Usually the c++ runtime allocates large blocks of memory using OS memory management functions, and then subdivides those up using its own heap implementation. The microsoft c++ runtime mostly uses the Win32 heap functions which are implemented in usermode, and divide up OS memory allocated using the virtual memory apis. There are thus no context switches until and unless its current allocation of virtual memory is needed and it needs to go to the OS to allocate more.
There is a theoretical problem when allocating memory that there is no upper bound on how long a heap traversal might take to find a free block. Practically tho, heap allocations are usually fast.
With the exception of threaded applications. Because most c++ runtimes share a single heap between multiple threads, access to the heap needs to be serialized. This can severly degrade the performance of certain classes of applications that rely on multiple threads being able to new and delete many objects.

If you new or delete an address it's marked as occupied or unassigned. The implementations do not talk all the time with the kernel. Bigger chucks of memory are reserved and divided in smaller chucks in user space within your application.
Because new and delete are re-entrant (or thread-safe depending on the implementation) a context switch may occur but your implementation is thread-safe anyway while using the default new and delete.
In C++ you are able to overwrite the new and delete operator, e.g. to place your memory management:
#include <cstdlib> //declarations of malloc and free
#include <new>
#include <iostream>
using namespace std;
class C {
public:
C();
void* operator new (size_t size); //implicitly declared as a static member function
void operator delete (void *p); //implicitly declared as a static member function
};
void* C::operator new (size_t size) throw (const char *){
void * p = malloc(size);
if (p == 0) throw "allocation failure"; //instead of std::bad_alloc
return p;
}
void C::operator delete (void *p){
C* pc = static_cast<C*>(p);
free(p);
}
int main() {
C *p = new C; // calls C::new
delete p; // calls C::delete
}

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js