Efficiently allocating many short-lived small objects - c++

I've got a small class (16 bytes on a 32bit system) which I need to dynamically allocate. In most cases the life-time of any given instance is very short. Some instances may also be passed across thread boundaries.
Having done some profiling, I found that my program appears to be spending more time allocating and deallocating the things than it's actually spending using them so I want to replace the default new and delete with something that a little more efficient.
For a large object (db connections as it happens, which are expensive to construct rather than allocate), I'm already using a pooling system, however that involves a list for storing the "free" objects, and also a mutex for thread safety. Between the mutex and the list it actually performs worse than with the basic new/delete for the small objects.
I found a number of small object allocators on Google, however they seem to be using a global/static pool which is not used in a thread safe manner, making them unsuitable for my use :(
What other options have I got for efficient memory management of such small objects?

Maybe try using Google's tcmalloc? It is optimized for fast allocation/deallocation in a threaded program, and has low overhead for small objects.

Some instances may also be passed across thread boundaries
Only "some"? So perhaps you can afford to pay extra for these, if it makes the ones that don't get passed to other threads cheaper. There are various ways I can think of to get to one allocator per thread and avoid the need to lock when allocating or freeing in the thread to which the allocator belongs. I don't know which might be possible in your program:
Copy things across the thread boundary, instead of passing them.
Arrange that if they're passed to another thread for any reason, then they're passed back to the original thread to free. This doesn't necessarily have to happen very often, you could queue up a few in the receiving thread and pass them all back in a message later. This assumes of course that the thread which owns the allocator is going to stick around.
Have two free lists per allocator, one synchronised (to which objects are added when they're freed from another thread), and one unsynchronised. Only if the unsynchronised list is empty, and you're allocating (and hence in the thread which owns the allocator), do you need to lock the synchronised free list and move all of its current contents to the unsynchronised list. If objects being passed to other threads is rare, this basically eliminates contention on the mutex and massively reduces the number of times it's taken at all.
If all the above fails, having one allocator per thread might still allow you to get rid of the mutex and use a lock-free queue for the free list (multiple writers freeing, single reader allocating), which could improve performance a bit. Implementing a lock-free queue is platform-specific.
Taking a step further back, does your app frequently hit a state in which you know that all cells allocated after a certain point (perhaps a little in the past), are no longer in use? If so, and assuming the destructor of your small objects doesn't do anything terribly urgent, then don't bother freeing cells at all - at the "certain point" create a new allocator and mark the old one as no longer in use for new allocations. When you "hit the state", free the whole allocator and its underlying buffer. If the "certain point" and the "state" are simultaneous, all the easier - just reset your allocator.

You might make sure that you are using the low fragmentation heap. It is on by default in Vista and later, but I do not think that is so with earlier OS's. That can make a big difference in allocation speed for small objects.

Related

Is access to the heap serialized?

One rule every programmer quickly learns about multithreading is:
If more than one thread has access to a data structure, and at least one of threads might modify that data structure, then you'd better serialize all accesses to that data structure, or you're in for a world of debugging pain.
Typically this serialization is done via a mutex -- i.e. a thread that wants to read or write the data structure locks the mutex, does whatever it needs to do, and then unlocks the mutex to make it available again to other threads.
Which brings me to the point: the memory-heap of a process is a data structure which is accessible by multiple threads. Does this mean that every call to default/non-overloaded new and delete is serialized by a process-global mutex, and is therefore a potential serialization-bottleneck that can slow down multithreaded programs? Or do modern heap implementations avoid or mitigate that problem somehow, and if so, how do they do it?
(Note: I'm tagging this question linux, to avoid the correct-but-uninformative "it's implementation-dependent" response, but I'd also be interested in hearing about how Windows and MacOS/X do it as well, if there are significant differences across implementations)
new and delete are thread safe
The following functions are required to be thread-safe:
The library versions of operator new and operator delete
User replacement versions of global operator new and operator delete
std::calloc, std::malloc, std::realloc, std::aligned_alloc, std::free
Calls to these functions that allocate or deallocate a particular unit of storage occur in a single total order, and each such deallocation call happens-before the next allocation (if any) in this order.
With gcc, new is implemented by delegating to malloc, and we see that their malloc does indeed use a lock. If you are worried about your allocation causing bottlenecks, write your own allocator.
Answer is yes, but in practice it is usually not a problem.
If it is a problem for you you may try replacing your implementation of malloc with tcmalloc that reduces, but does not eliminate possible contention(since there is only 1 heap that needs to be shared among threads and processes).
TCMalloc assigns each thread a thread-local cache. Small allocations are satisfied from the thread-local cache. Objects are moved from central data structures into a thread-local cache as needed, and periodic garbage collections are used to migrate memory back from a thread-local cache into the central data structures.
There are also other options like using custom allocators and/or specialized containers and/or redesigning your application.
As you tried to avoid the the answer is architecture/system dependant by trying to avoid the problem that multiple threads must serialize accesses, this only happens with heaps that grow or shrink when the program needs to expand it or return part of it to the system.
The first answer has to be simply it's implementation dependant, without any system dependencies, because normally, libraries get large chunks of memory to base the heap and they administer those internally, which makes the problem actually operating system and architecture independent.
The second answer is that, of course, if you have only one single heap for all the threads, you'll have a possible bottleneck in case all of the active threads compete for a single chunk of memory. There are several approaches to this, you can have a pool of heaps to allow parallelism, and make the different threads use different pools for their requests, think that the possible largest problem is in requesting memory, as this is the case when you have the bottleneck. On returning there's not such issue, as you can act more like a garbage collector in which you queue the returned chunks of memory and enqueue them for a thread to dispatch and put those chunks in the proper places to conserve the heaps integrities. Having multiple heaps allows even to classify them by priorities, by chunk sizes, etc. so the risk of collision is made low by the class or problem you are going to deal with. This is the case of operating system kernels like *BSD, which use several memory heaps, somewhat dedicated to the kind of use they are going to receive (there's one for the io-disk buffers, one for virtual memory mapped segments, one for process virtual memory space management, etc)
I recommend you to read The design and implementation of the FreeBSD Operating System which explains very well the approach used in the kernel of BSD systems. This is general enough and probably a great percentage of the other systems follow this or a very similar approach.

cannot allocate memory fast enough?

Assume you are tasked to address a performance bottleneck in an application. Via profiling we discover the bottleneck is related to memory allocation. We discover that the application can only perform N memory allocations per second, no matter how many threads we have allocating memory. Why would we be seeing this behavior and how might we increase the rate at which the application can allocate memory. (Assume that we cannot change the size of the memory blocks that we are allocating. Assume that we cannot reduce the use of dynamically allocated memory.)
Okay, a few solutions exist - however almost all of them seem to be excluded via some constraint or another.
1. Have more threads allocate memory
We discover that the application can only perform N memory allocations per second, no matter how many threads we have allocating memory.
From this, we can cross-off any ideas of adding more threads (since "no matter how many threads"...).
2. Allocate more memory at a time
Assume that we cannot change the size of the memory blocks that we are allocating.
Fairly obviously, we have to allocate the same block size.
3. Use (some) static memory
Assume that we cannot reduce the use of dynamically allocated memory.
This one I found most interesting.. Reminded me of a story I heard about a FORTRAN programmer (before Fortran had dynamic memory allocation) whom just used a HUGE static array allocated on the stack as a private heap.
Unfortunately, this constraint prevents us from using such a trick.. However, it does give a glean into one aspect of a (the) solution.
My Solution
At the start of execution (either of the program, or on a per-thread basis) make several^ memory allocation system calls. Then use the memory from these later in the program (along with the existing dynamic memory allocations).
* Note: The 'several' would probably be an exact number, determined from your profiling, which the question mentions in the beginning.
TL;DR
The trick is to modify the timing of the memory allocations.
Looks like a challenging problem, though without details, you can only do some guesses. (Which is most likely the idea of this question)
The limitation here is the number of allocations, not the size of the allocation.
If we can assume that you are in control of where it allocations occur, you can allocate the memory for multiple instances at once. Please consider the code below as pseudo code, as it's only for illustration purpose.
const static size_t NR_COMBINED_ALLOCATIONS = 16;
auto memoryBuffer = malloc(size_of(MyClass)*NR_COMBINED_ALLOCATIONS);
size_t nextIndex = 0;
// Some looping code
auto myNewClass = new(memoryBuffer[nextIndex++]) MyClass;
// Some code
myNewClass->~MyClass();
free(memoryBuffer);
Your code will most likely become a lot more complex, though you will most likely tackle this bottleneck. In case you have to return this new class, you even need even more code just to do memory management.
Given this information, you can write your own implementation of allocators for your STL, override the 'new' and 'delete' operators ...
If that would not be enough, try challenging the limitations. Why can you only do a fixed number of allocations, is this because of unique locking? If so, can we improve this? Why do you need that many allocations, would changing the algorithm that is being used fix this issue ...
... the application can only perform N memory allocations per second,
no matter how many threads we have allocating memory. Why would we be
seeing this behavior and how might we increase the rate at which the
application can allocate memory.
IMHO, the most likely cause is that the allocations are coming from a common system pool.
Because they share a pool, each thread has to gain access thru some critical section blocking mechanism (perhaps a semaphore).
The more threads competing for dynamic memory (i.e. using new) will cause more critical section blocking.
The context switch between tasks is the time waste here.
How increase the rate?
option 1 - serialize the usage ... and this means, of course, that you can not simply try to use a semaphore at another level. For one system I worked on, a high dynamic memory utilization happened during system start up. In that case, it was easiest to change the start up such that thread n+1 (of this collection) only started after thread n had completed its initialization and fell into its wait-for-input loop. With only 1 thread doing its start up thing at a time, (and very few other dynamic memory users yet running) no critical section blockage occurred. 4 simultaneous start ups would take 30 seconds. 4 serialized startups finished in 5 seconds.
option 2 - provide a pool of ram and a private new/delete for each particular thread. If only one thread access a pool at a time, a critical section or semaphore is not needed. In an embedded system, the challenge here is allocate a reasonable amount of private pool for the thread and not too much waste. On a desktop with multi-gigabytes of ram, this is probably less of a problem.
I believe you could use a separate thread which could be responsible for memory allocation. This thread would have a queue containing a map of thread identifiers and needed memory allocation. Threads would not directly allocate memory, but rather send an allocation request to the queue and go into a wait state. The queue, on its turn would try to process each requested memory allocation from the queue and wake the corresponding sleeping thread up. When the thread responsible for memory handling can not process an allocation due to limitation, it should wait until memory can be allocated again.
One could build another layer into the solution as #Tersosauros's solution suggested to slightly optimize speed, but it should be based on something like the idea above nonetheless.

Multithread Memory Profiling in C++

I am working on profiling the memory usage of multiple threads in my application. I would like to be able to track the maximum allocation/current allocation of any given thread that is running. In order to so, I planned on interposing on mallocs/frees. During each call to malloc, I would update the allocation records for the particular thread in a static map that associated thread ids to their particular metadata record. I am currently having issues during process exit. I think the issue is that when all the destructors are called for cleanup, the static map and lock protecting it have to be destroyed. My interposed mallocs/frees, however, acquire the lock before updating the profiling metadata structures. Eventually, the lock is destroyed, but there are subsequent calls to malloc/free that result in an attempt to acquire the no longer existent lock resulting in a segfault.
Another issue that I am concerned about is that there are internal calls to malloc generated within my interposed malloc to allocate entries in the map.
Any ideas on ways of approaching the problem of profiling memory usage on a per thread basis? Any suggestions on data structures to track the usage of each thread? Does the above approach seem reasonable or are there any other ways of approaching the problem?
If you store your "extra" data as part of the allocation itself (before is easier, but you could do it after too - just need a size somewhere), then you shouldn't need any locks at all. Just a tad more memory. Of course, you will need to use atomics to update any lists of items.
If you look at this answer:
Setting memory on a custom heap
and imagine that HeapAlloc and HeapFree are malloc and free respectively. Then add code to store which thread is being used for the allocation.
So, instead of using a map, you simply update a linked list (using atomics to prevent multiple updates). This does of course make it a little more difficult to make the up to date measurements per thread, you'll have to scan the list of allocations.
Of course, this only works for DIRECT calls to malloc and free.
The same principle would be possible by "injecting" a replacement malloc/free function (built along the principles in the other post, but of course not using the original malloc to allocate the memory, and not using free to free the memory).
This is a complicated thing to do and make work for all cases. There are many issues that you'll miss and only ever find through trial and error. I should know, I've been responsible for building a tool that does what you are trying to do. We've been doing this since 1999, available commercially since 2002.
If you are using Windows, C++ Memory Validator can give you per-thread profiling statistics.
http://www.softwareverify.com/cpp-memory.php.
The Objects tab and Sizes tab both have Threads sub-tabs which allow you to view data per thread. You can also run advanced queries on the Analysis tab that will allow you to view data on a per-thread basis.
Spend your time on your job, not writing tools.

Can i allocate memory faster by using multiple threads?

If i make a loop that reserves 1kb integer arrays, int[1024], and i want it to allocate 10000 arrays, can i make it faster by running the memory allocations from multiple threads?
I want them to be in the heap.
Let's assume that i have a multi-core processor for the job.
I already did try this, but it decreased the performance. I'm just wondering, did I just make bad code or is there something that i didn't know about memory allocation?
Does the answer depend on the OS? please tell me how it works on different platforms if so.
Edit:
The integer array allocation loop was just a simplified example. Don't bother telling me how I can improve that.
It depends on many things, but primarily:
the OS
the implementation of malloc you are using
The OS is responsible for allocating the "virtual memory" that your process has access to and builds a translation table that maps the virtual memory back to actual memory addresses.
Now, the default implementation of malloc is generally conservative, and will simply have a giant lock around all this. This means that requests are processed serially, and the only thing that allocating from multiple threads instead of one does is slowing down the whole thing.
There are more clever allocation schemes, generally based upon pools, and they can be found in other malloc implementations: tcmalloc (from Google) and jemalloc (used by Facebook) are two such implementations designed for high-performance in multi-threaded applications.
There is no silver bullet though, and at one point the OS must perform the virtual <=> real translation which requires some form of locking.
Your best bet is to allocate by arenas:
Allocate big chunks (arenas) at once
Split them up in arrays of the appropriate size
There is no need to parallelize the arena allocation, and you'll be better off asking for the biggest arenas you can (do bear in mind that allocation requests for a too large amount may fail), then you can parallelize the split.
tcmalloc and jemalloc may help a bit, however they are not designed for big allocations (which is unusual) and I do not know if it is possible to configure the size of the arenas they request.
The answer depends on the memory allocations routine, which are a combination of a C++ library layer operator new, probably wrapped around libC malloc(), which in turn occasionally calls an OS function such as sbreak(). The implementation and performance characteristics of all of these is unspecified, and may vary from compiler version to version, with compiler flags, different OS versions, different OSes etc.. If profiling shows it's slower, then that's the bottom line. You can try varying the number of threads, but what's probably happening is that the threads are all trying to obtain the same lock in order to modify the heap... the overheads involved with saying "ok, thread X gets the go ahead next" and "thread X here, I'm done" are simply wasting time. Another C++ environment might end up using atomic operations to avoid locking, which might or might not prove faster... no general rule.
If you want to complete faster, consider allocating one array of 10000*1024 ints, then using different parts of it (e.g. [0]..[1023], [1024]..[2047]...).
I think that perhaps you need to adjust your expectation from multi-threading.
The main advantage of multi-threading is that you can do tasks asynchronously, i.e. in parallel. In your case, when your main thread needs more memory it does not matter whether it is allocated by another thread - you still need to stop and wait for allocation to be accomplished, so there is no parallelism here. In addition, there is an overhead of a thread signaling when it is done and the other waiting for completion, which just can degrade the performance. Also, if you start a thread each time you need allocation this is a huge overhead. If not, you need some mechanism to pass the allocation request and response between threads, a kind of task queue which again is an overhead without gain.
Another approach could be that the allocating thread runs ahead and pre-allocates the memory that you will need. This can give you a real gain, but if you are doing pre-allocation, you might as well do it in the main thread which will be simpler. E.g. allocate 10M in one shot (or 10 times 1M, or as much contiguous memory as you can have) and have an array of 10,000 pointers pointing to it at 1024 offsets, representing your arrays. If you don't need to deallocate them independently of one another this seems to be much simpler and could be even more efficient than using multi-threading.
As for glibc it has arena's (see here), which has lock per arena.
You may also consider tcmalloc by google (stands for Thread-Caching malloc), which shows 30% boost performance for threaded application. We use it in our project. In debug mode it even can discover some incorrect usage of memory (e.g. new/free mismatch)
As far as I know all os have implicit mutex lock inside the dynamic allocating system call (malloc...). If you think a moment about that, if you do not lock this action you could run into terrible problems.
You could use the multithreading api threading building blocks http://threadingbuildingblocks.org/
which has a multithreading friendly scalable allocator.
But I think a better idea should be to allocate the whole memory once(should work quite fast) and split it up on your own. I think the tbb allocator does something similar.
Do something like
new int[1024*10000] and than assign the parts of 1024ints to your pointer array or what ever you use.
Do you understand?
Because the heap is shared per-process the heap will be locked for each allocation, so it can only be accessed serially by each thread. This could explain the decrease of performance when you do alloc from multiple threads like you are doing.
If the arrays belong together and will only be freed as a whole, you can just allocate an array of 10000*1024 ints, and then make your individual arrays point into it. Just remember that you cannot delete the small arrays, only the whole.
int *all_arrays = new int[1024 * 10000];
int *small_array123 = all_arrays + 1024 * 123;
Like this, you have small arrays when you replace the 123 with a number between 0 and 9999.
The answer depends on the operating system and runtime used, but in most cases, you cannot.
Generally, you will have two versions of the runtime: a multi-threaded version and a single-threaded version.
The single-threaded version is not thread-safe. Allocations made by two threads at the same time can blow your application up.
The multi-threaded version is thread-safe. However, as far as allocations go on most common implementations, this just means that calls to malloc are wrapped in a mutex. Only one thread can ever be in the malloc function at any given time, so attempting to speed up allocations with multiple threads will just result in a lock convoy.
It may be possible that there are operating systems that can safely handle parallel allocations within the same process, using minimal locking, which would allow you to decrease time spent allocating. Unfortunately, I don't know of any.

Slow down associated with vector::push_back and placement new in a multithreaded application

I have a multithreaded application in which my thread utilization is very poor (in the ball park of 1%-4% per thread, with fewer threads than processors). In the debugger, it appears to be spending a lot of time in vector::push_back, specifically the placement new that occurs during the push_back. I've tried using reserve to avoid having the vector expand its capacity and copy everything, but that doesn't appear to be the problem. Commenting out the vector::push_backs leads to much better thread utilization.
This problem is occurring with vectors of uint64_t, so it does not appear to be the result of complicated object construction. I have tried using both the standard allocator and a custom allocator and both perform the same way. The vectors are being used by the same thread that allocated them.
Unless you need these initialized to 0, consider writing a vector-like class which does not initialize. I've found this to provide measurable performance gains in some scenarios.
Side note: When your profiler claims you're spending most your time with primitive operations on 64-bit integers, you know the rest of your code is optimized decently.
Maybe something trivial that won't really work, but as the push_back calls create an new item, why not initialize the vector to all 0's, and access the elements with something like at or operator[]. That should get rid of any lock on the vector.
Does the thread utilization improve if you only use one thread? If so, perhaps you are running afoul of some sort of heap lock, eg
In multithreaded C/C++, does malloc/new lock the heap when allocating memory
http://msdn.microsoft.com/en-us/library/ms810466.aspx