Dynamic Lock-free memory allocators - c++

One of the difficulties in writing algorithms or data structures that satisfy lock-free progress guarantees is dynamic memory allocation: calling something like malloc or new isn't guaranteed to be lock-free in a portable manner. However, many lock-free implementations of malloc or new exist, and there are also a variety of lock-free memory allocators that can be used to implement lock-free algorithms/data-structures.
However, I still don't understand how this can actually completely satisfy lock-free progress guarantees, unless you specifically limit your data-structure or algorithm to some pre-allocated static memory pool. But if you need dynamic memory allocation, I don't understand how any alleged lock-free memory allocator can ever be truly lock-free in the long-run. The problem is that no matter how amazing your lock-free malloc or new might be, ultimately you may run out of memory, at which point you have to fall back on asking the operating system for more memory. That means that ultimately you have to call brk() or mmap() or some such low-level equivalent to actually get access to more memory. And there is simply no guarantee that any of these low-level calls are implemented in a lock-free manner.
There's simply no way around this, (unless you're using an ancient OS like MS-DOS that doesn't provide memory protection, or you write your own completely lock-free operating system - two scenarios that are not practical or likely.) So, how can any dynamic memory allocator truly be lock-free?

As you have found yourself, the fundamental OS allocator is most likely not lock-free, because it has to deal with multiple processes and all manner of interesting stuff that makes it really hard to not introduce some sort of lock.
For some cases, however, the "lock free memory allocation" doesn't mean "never locks", but "statistically locks so rarely that it doesn't really matter". Which is fine for anything but the most strict real-time systems. If you don't have high contention on your lock, then lock or no lock doesn't really matter - the purpose of lock-free is not really the overhead of the lock itself, but the ease with which it becomes a bottle-neck where every thread or process in the system has to pass through this one place to do anything useful, and as it does so, it has to wait in queue [it may not be a true queue either, it may be "whoever wakes first" or some other mechanism that decides who comes out next, after the current caller].
There are a few different options to solve this:
If you have a memory pool with a finite size, you can ask the OS for all that memory at once, when the software is being started. And after the memory has been chunked out from the OS, it can be used as a lock-free pool. The obvious drawback is that it has a limit to how much memory can be allocated. You then either have to stop allocating (fail the application alltogether, or fail that particular operation).
Of course, in a system like Linux or Windows, there is still no guarantee that memory allocation, in a lock-free scenario, means "instant access to the allocated memory", since the system can and will allocate memory without actual physical memory backing to it, and only once the memory is actually being used, the physical memory page is assigned to it. This may both involve locks and for example disk-I/O to page out some other page to the swap.
For such strict real-time systems that the time of a single system call that may contend for a lock is "too much", the solution is of course to use a dedicated OS, one that has a lock-free allocator inside the OS (or at least one that has a known real-time behaviour that is acceptable - it locks for at most X microsecons [X can be less than 1.0]). Real-time systems often have a pool of memory and fixed size buckets for recycling old allocations, which can be done in a lock-free manner - the buckets are a linked list, so you can insert/remove from that list with atomic compare&exchange operations [possibly with retry, so although it's technically lock-free, it's not zero wait time in contended situations].
Another solution that can work is to have "per thread pools". This can get a bit complicated if you pass data between threads, but if you either accept that "memory freed for reuse may end up in a different thread" (which of course leads to problems along the lines of "all the memory now sits in that one thread that collects and frees the information from many other threads, and all other threads have run out of memory")

Related

Is access to the heap serialized?

One rule every programmer quickly learns about multithreading is:
If more than one thread has access to a data structure, and at least one of threads might modify that data structure, then you'd better serialize all accesses to that data structure, or you're in for a world of debugging pain.
Typically this serialization is done via a mutex -- i.e. a thread that wants to read or write the data structure locks the mutex, does whatever it needs to do, and then unlocks the mutex to make it available again to other threads.
Which brings me to the point: the memory-heap of a process is a data structure which is accessible by multiple threads. Does this mean that every call to default/non-overloaded new and delete is serialized by a process-global mutex, and is therefore a potential serialization-bottleneck that can slow down multithreaded programs? Or do modern heap implementations avoid or mitigate that problem somehow, and if so, how do they do it?
(Note: I'm tagging this question linux, to avoid the correct-but-uninformative "it's implementation-dependent" response, but I'd also be interested in hearing about how Windows and MacOS/X do it as well, if there are significant differences across implementations)
new and delete are thread safe
The following functions are required to be thread-safe:
The library versions of operator new and operator delete
User replacement versions of global operator new and operator delete
std::calloc, std::malloc, std::realloc, std::aligned_alloc, std::free
Calls to these functions that allocate or deallocate a particular unit of storage occur in a single total order, and each such deallocation call happens-before the next allocation (if any) in this order.
With gcc, new is implemented by delegating to malloc, and we see that their malloc does indeed use a lock. If you are worried about your allocation causing bottlenecks, write your own allocator.
Answer is yes, but in practice it is usually not a problem.
If it is a problem for you you may try replacing your implementation of malloc with tcmalloc that reduces, but does not eliminate possible contention(since there is only 1 heap that needs to be shared among threads and processes).
TCMalloc assigns each thread a thread-local cache. Small allocations are satisfied from the thread-local cache. Objects are moved from central data structures into a thread-local cache as needed, and periodic garbage collections are used to migrate memory back from a thread-local cache into the central data structures.
There are also other options like using custom allocators and/or specialized containers and/or redesigning your application.
As you tried to avoid the the answer is architecture/system dependant by trying to avoid the problem that multiple threads must serialize accesses, this only happens with heaps that grow or shrink when the program needs to expand it or return part of it to the system.
The first answer has to be simply it's implementation dependant, without any system dependencies, because normally, libraries get large chunks of memory to base the heap and they administer those internally, which makes the problem actually operating system and architecture independent.
The second answer is that, of course, if you have only one single heap for all the threads, you'll have a possible bottleneck in case all of the active threads compete for a single chunk of memory. There are several approaches to this, you can have a pool of heaps to allow parallelism, and make the different threads use different pools for their requests, think that the possible largest problem is in requesting memory, as this is the case when you have the bottleneck. On returning there's not such issue, as you can act more like a garbage collector in which you queue the returned chunks of memory and enqueue them for a thread to dispatch and put those chunks in the proper places to conserve the heaps integrities. Having multiple heaps allows even to classify them by priorities, by chunk sizes, etc. so the risk of collision is made low by the class or problem you are going to deal with. This is the case of operating system kernels like *BSD, which use several memory heaps, somewhat dedicated to the kind of use they are going to receive (there's one for the io-disk buffers, one for virtual memory mapped segments, one for process virtual memory space management, etc)
I recommend you to read The design and implementation of the FreeBSD Operating System which explains very well the approach used in the kernel of BSD systems. This is general enough and probably a great percentage of the other systems follow this or a very similar approach.

Shouldn't malloc be asynchronous?

Am I correct to assume that when a process calls malloc, there may be I/O involved (swapping caches out etc) to make memory available, which in turn implies it can block considerable time? Thus, shouldn't we have two versions of malloc in linux, one say "fast_malloc" which is suitable for obtaining smaller chunks & guaranteed not to block (but may of course still fail with OUT_OF_MEMORY) and another async_malloc where we could ask for arbitrary-size space but require a callback?
Example: if I need a smaller chunk of memory to make room for an item in the linked-list, I may prefer the traditional inline malloc knowing the OS should be able to satisfy it 99.999% of the time or just fail. Another example: if I'm a DB server trying to allocate a sizable chunk to put indexes in it I may opt for the async_malloc and deal with the "callback complexity".
The reason I brought this up is that I'm looking to create highly concurrent servers handling hundreds of thousands of web requests per second and generally avoid threads for handling the requests. Put another way, anytime I/O occurs I want it to be asynchronous (say libevent based). Unfortunately I'm realizing most C APIs lack proper support for concurrent use. For example, the ubiquitous MySQL C library is entirely blocking, and that's just one library my servers use extensively. Again, I can always simulate non-blocking by offloading to another thread but that's nowhere near as cheap as waiting for result via completion callback.
As kaylum said in a comment:
Calling malloc will not inherently cause more IO. Perhaps you are confusing use of the memory returned versus just allocating the memory to you. Just because you ask for 100MB does not mean that malloc will immediately trigger 100MB of swapping. That only happens when you access the memory.
If you want to protect against long delays for swapping, etc. during subsequent access to the allocated memory, you can call mlock on it in a separate thread (so your process isn't stalled waiting for mlock to complete). Once mlock has succeeded, the memory is physically instantiated and cannot be swapped out until munlock.
Remember that a call to malloc() does not necessarily result in your program asking the OS for more memory. It's down to the C runtime's implementation of malloc().
For glibc malloc() merely (depending on how much you're asking for) returns a pointer to memory that the runtime has already got from the OS. Similarly free() doesn't necessarily return memory to the OS. It's a lot faster that way. I think glibc's malloc() is thread safe too.
Interestingly this gives C, C++ (and everything built on top) the same sort of properties normally associated with languages like Java and C#. Arguably building a runtime like Java or C# on top of a runtime like glibc means that there's actually more work than necessary going on to manage memory... Unless they're not using malloc() or new at all.
There's various allocators out there, and you can link whichever one you want into your program regardless of what your normal C runtime provides. So even on platforms like *BSD (which are typically far more traditional in their memory allocation approach, asking the OS each and every time you call malloc() or new) you can pull off the same trick.
Put another way, anytime I/O occurs I want it to be asynchronous (say libevent based).
I have bad news for you. Any time you access memory you risk blocking for I/O.
malloc itself is quite unlikely to block because the system calls it uses just create an entry in a data structure that tells the kernel "map in some memory here when it's accessed". This means that malloc will only block when it needs to go down to the kernel to map more memory and either the kernel is out of memory so that it itself has to wait for allocating its internal data structure (you can wait for quite a while then) or you use mlockall. The actual allocating of memory that can cause swapping doesn't happen until you touch memory. And your own memory can be swapped out at any time (or your program text could be paged out) and you have pretty much no control over it.

TBB tbb::memory_pool<tbb::scalable_allocator<char>> How to use it correctly?

I have doubt.
For tbb::memory_pool< tbb::scalable_allocator > shared_memory_pool_;
if that is instantiated in the main thread. And than, I called shared_memory_pool_.malloc(sizeof(my_class)) in a worker thread. Will tbb allocate that size of memory from the main heap, or would it allocate it from the thread "domain" so that the lock contention causes by the normal malloc() would still be avoided?
The tbb::memory_pool is based on the same internals as tbb::scalable_allocator. So, once the memory pool grabs the memory initially (in your case as you specified, from tbb::scalable_allocator as well), it will use the same mechanisms to distribute and reuse it across the threads. I.e. it is scalable and avoids the global lock as much as possible. Though, as the memory is still a common resource, some thread synchronization is unavoidable anyway. Specifically, I'd expect more contention for initial memory requests since the per-thread caches are not warm yet; and also the scalable_allocator tries to keep the balance between scalability and memory consumption thus it'll not go crazy with per-thread caches redistributing the memory among threads which is also kind of thread synchronization (though more scalable than a lock).
As for the [very] initial memory allocation by scalable_allocator, it goes through mmap or VirtualAlloc for big enough memory chunks and not through malloc.
Here is some useful descriptions about how to implement a memory pool correctly. Please note that according to that:
In our implementation, we tried to provide more general functionality
in thread-safe and scalable way. For that purpose, the implementation
of the memory pools is based on TBB scalable memory allocator and so
has similar speed and memory consumption properties.
Hope this helps.

Heap optimized for (but not limited to) single-threaded usage

I use a custom heap implementation in one of my projects. It consists of two major parts:
Fixed size-block heap. I.e. a heap that allocates blocks of a specific size only. It allocates larger memory blocks (either virtual memory pages or from another heap), and then divides them into atomic allocation units.
It performs allocation/freeing fast (in O(1)) and there's no memory usage overhead, not taking into account things imposed by the external heap.
Global general-purpose heap. It consists of buckets of the above (fixed-size) heaps. WRT the requested allocation size it chooses the appropriate bucket, and performs the allocation via it.
Since the whole application is (heavily) multi-threaded - the global heap locks the appropriate bucket during its operation.
Note: in contrast to the traditional heaps, this heap requires the allocation size not only for the allocation, but also for freeing. This allows to identify the appropriate bucket without searches or extra memory overhead (such as saving the block size preceding the allocated block). Though somewhat less convenient, this is ok in my case. Moreover, since the "bucket configuration" is known at compile-time (implemented via C++ template voodoo) - the appropriate bucket is determined at compile time.
So far everything looks (and works) good.
Recently I worked on an algorithm that performs heap operations heavily, and naturally affected significantly by the heap performance. Profiling revealed that its performance is considerably impacted by the locking. That is, the heap itself works very fast (typical allocation involves just a few memory dereferencing instructions), but since the whole application is multi-threaded - the appropriate bucket is protected by the critical section, which relies on interlocked instructions, which are much heavier.
I've fixed this meanwhile by giving this algorithm its own dedicated heap, which is not protected by a critical section. But this imposes several problems/restrictions at the code level. Such as the need to pass the context information deep within the stack wherever the heap may be necessary. One may also use TLS to avoid this, but this may cause some problems with re-entrance in my specific case.
This makes me wonder: Is there a known technique to optimize the heap for (but not limit to) single-threaded usage?
EDIT:
Special thanks to #Voo for suggesting checking out the google's tcmalloc.
It seems to work similar to what I did more-or-less (at least for small objects). But in addition they solve the exact issue I have, by maintaining per-thread caching.
I too thought in this direction, but I thought about maintaining per-thread heaps. Then freeing a memory block allocated from the heap belonging to another thread is somewhat tricky: one should insert it in a sort of a locked queue, and that other thread should be notified, and free the pending allocations asynchronously. Asynchronous deallocation may cause problems: if that thread is busy for some reason (for instance performs an aggressive calculations) - no memory deallocation actually occurs. Plus in multi-threaded scenario the cost of deallocation is significantly higher.
OTOH the idea with caching seems much simpler, and more efficient. I'll try to work it out.
Thanks a lot.
P.S.:
Indeed google's tcmalloc is great. I believe it's implemented pretty much similar to what I did (at least fixed-size part).
But, to be pedantic, there's one matter where my heap is superior. According to docs, tcmalloc imposes an overhead roughly 1% (asymptotically), whereas my overhead is 0.0061%. It's 4/64K to be exact.
:)
One thought is to maintain a memory allocator per-thread. Pre-assign fairly chunky blocks of memory to each allocator from a global memory pool. Design your algorithm to assign the chunky blocks from adjacent memory addresses (more on that later).
When the allocator for a given thread is low on memory, it requests more memory from the global memory pool. This operation requires a lock, but should occur far less frequently than in your current case. When the allocator for a given thread frees it's last byte, return all memory for that allocator to the global memory pool (assume thread is terminated).
This approach will tend to exhaust memory earlier than your current approach (memory can be reserved for one thread that never needs it). The extent to which that is an issue depends on the thread creation / lifetime / destruction profile of your app(s). You can mitigate that at the expense of additional complexity, e.g. by introducing a signal that a memory allocator for given thread is out of memory, and the global pool is exhaused, that other memory allocators can respond to by freeing some memory.
An advantage of this scheme is that it will tend to eliminate false sharing, as memory for a given thread will tend to be allocated in contiguous address spaces.
On a side note, if you have not already read it, I suggest IBM's Inside Memory Management article for anyone implementing their own memory management.
UPDATE
If the goal is to have very fast memory allocation optimized for a multi-threaded environment (as opposed to learning how to do it yourself), have a look at alternate memory allocators. If the goal is learning, perhaps check out their source code.
Hoarde
tcmalloc (thanks Voo)
It might be a good idea to read Jeff Bonwicks classic papers on the slab allocator and vmem. The original slab allocator sounds somewhat what you're doing. Although not very multithread friendly it might give you some ideas.
The Slab Allocator: An Object-Caching Kernel Memory Allocator
Then he extended the concept with VMEM, which will definitely give you some ideas since it had very nice behavior in a multi cpu environment.
Magazines and Vmem: Extending the Slab Allocator to Many CPUs and Arbitrary Resources

Can i allocate memory faster by using multiple threads?

If i make a loop that reserves 1kb integer arrays, int[1024], and i want it to allocate 10000 arrays, can i make it faster by running the memory allocations from multiple threads?
I want them to be in the heap.
Let's assume that i have a multi-core processor for the job.
I already did try this, but it decreased the performance. I'm just wondering, did I just make bad code or is there something that i didn't know about memory allocation?
Does the answer depend on the OS? please tell me how it works on different platforms if so.
Edit:
The integer array allocation loop was just a simplified example. Don't bother telling me how I can improve that.
It depends on many things, but primarily:
the OS
the implementation of malloc you are using
The OS is responsible for allocating the "virtual memory" that your process has access to and builds a translation table that maps the virtual memory back to actual memory addresses.
Now, the default implementation of malloc is generally conservative, and will simply have a giant lock around all this. This means that requests are processed serially, and the only thing that allocating from multiple threads instead of one does is slowing down the whole thing.
There are more clever allocation schemes, generally based upon pools, and they can be found in other malloc implementations: tcmalloc (from Google) and jemalloc (used by Facebook) are two such implementations designed for high-performance in multi-threaded applications.
There is no silver bullet though, and at one point the OS must perform the virtual <=> real translation which requires some form of locking.
Your best bet is to allocate by arenas:
Allocate big chunks (arenas) at once
Split them up in arrays of the appropriate size
There is no need to parallelize the arena allocation, and you'll be better off asking for the biggest arenas you can (do bear in mind that allocation requests for a too large amount may fail), then you can parallelize the split.
tcmalloc and jemalloc may help a bit, however they are not designed for big allocations (which is unusual) and I do not know if it is possible to configure the size of the arenas they request.
The answer depends on the memory allocations routine, which are a combination of a C++ library layer operator new, probably wrapped around libC malloc(), which in turn occasionally calls an OS function such as sbreak(). The implementation and performance characteristics of all of these is unspecified, and may vary from compiler version to version, with compiler flags, different OS versions, different OSes etc.. If profiling shows it's slower, then that's the bottom line. You can try varying the number of threads, but what's probably happening is that the threads are all trying to obtain the same lock in order to modify the heap... the overheads involved with saying "ok, thread X gets the go ahead next" and "thread X here, I'm done" are simply wasting time. Another C++ environment might end up using atomic operations to avoid locking, which might or might not prove faster... no general rule.
If you want to complete faster, consider allocating one array of 10000*1024 ints, then using different parts of it (e.g. [0]..[1023], [1024]..[2047]...).
I think that perhaps you need to adjust your expectation from multi-threading.
The main advantage of multi-threading is that you can do tasks asynchronously, i.e. in parallel. In your case, when your main thread needs more memory it does not matter whether it is allocated by another thread - you still need to stop and wait for allocation to be accomplished, so there is no parallelism here. In addition, there is an overhead of a thread signaling when it is done and the other waiting for completion, which just can degrade the performance. Also, if you start a thread each time you need allocation this is a huge overhead. If not, you need some mechanism to pass the allocation request and response between threads, a kind of task queue which again is an overhead without gain.
Another approach could be that the allocating thread runs ahead and pre-allocates the memory that you will need. This can give you a real gain, but if you are doing pre-allocation, you might as well do it in the main thread which will be simpler. E.g. allocate 10M in one shot (or 10 times 1M, or as much contiguous memory as you can have) and have an array of 10,000 pointers pointing to it at 1024 offsets, representing your arrays. If you don't need to deallocate them independently of one another this seems to be much simpler and could be even more efficient than using multi-threading.
As for glibc it has arena's (see here), which has lock per arena.
You may also consider tcmalloc by google (stands for Thread-Caching malloc), which shows 30% boost performance for threaded application. We use it in our project. In debug mode it even can discover some incorrect usage of memory (e.g. new/free mismatch)
As far as I know all os have implicit mutex lock inside the dynamic allocating system call (malloc...). If you think a moment about that, if you do not lock this action you could run into terrible problems.
You could use the multithreading api threading building blocks http://threadingbuildingblocks.org/
which has a multithreading friendly scalable allocator.
But I think a better idea should be to allocate the whole memory once(should work quite fast) and split it up on your own. I think the tbb allocator does something similar.
Do something like
new int[1024*10000] and than assign the parts of 1024ints to your pointer array or what ever you use.
Do you understand?
Because the heap is shared per-process the heap will be locked for each allocation, so it can only be accessed serially by each thread. This could explain the decrease of performance when you do alloc from multiple threads like you are doing.
If the arrays belong together and will only be freed as a whole, you can just allocate an array of 10000*1024 ints, and then make your individual arrays point into it. Just remember that you cannot delete the small arrays, only the whole.
int *all_arrays = new int[1024 * 10000];
int *small_array123 = all_arrays + 1024 * 123;
Like this, you have small arrays when you replace the 123 with a number between 0 and 9999.
The answer depends on the operating system and runtime used, but in most cases, you cannot.
Generally, you will have two versions of the runtime: a multi-threaded version and a single-threaded version.
The single-threaded version is not thread-safe. Allocations made by two threads at the same time can blow your application up.
The multi-threaded version is thread-safe. However, as far as allocations go on most common implementations, this just means that calls to malloc are wrapped in a mutex. Only one thread can ever be in the malloc function at any given time, so attempting to speed up allocations with multiple threads will just result in a lock convoy.
It may be possible that there are operating systems that can safely handle parallel allocations within the same process, using minimal locking, which would allow you to decrease time spent allocating. Unfortunately, I don't know of any.