Allocating memory will block whole of threads? [duplicate] - c++

I'm curious as to whether there is a lock on memory allocation if two threads simultaneously request to allocate memory. I am using OpenMP to do multithreading, C++ code.
OS's: mostly linux, but would like to know for Windows and Mac as well.

There could be improvements in certain implementations, such as creating a thread-specific cache (in this case allocations of small blocks will be lock-free). For instance, this from Google. But in general, yes, there is a lock on memory allocations.

By default Windows locks the heap when you use the Win API heap functions.
You can control the locking at least at the time of heap creation. Different compilers and C runtimes do different things with the malloc/free family. For example, the SmartHeap API at one point created one heap per thread and therefore needed no locking. There were also config options to turn that behavior on and off.
At one point in the early/mid '90s the Borland Windows and OS/2 compilers explicitly turned off Heap locking (a premature optimization bug) until multiple threads were launched with beginthread. Many many people tried to spawn threads with an OS API call and then were surprised when the heap corrupted itself all to hell...

http://en.wikipedia.org/wiki/Malloc
Modern malloc implementations try to be as lock-free as possible by keeping separate "arenas" for each thread.

Free store is a shared resource and must be synchronized. Allocation/deallocation is costly. If you are multithreading for performance, then frequent allocation/deallocation can become a bottleneck. As a general rule, avoid allocation/deallocation inside tight loops. Another problem is false sharing.

Related

Is access to the heap serialized?

One rule every programmer quickly learns about multithreading is:
If more than one thread has access to a data structure, and at least one of threads might modify that data structure, then you'd better serialize all accesses to that data structure, or you're in for a world of debugging pain.
Typically this serialization is done via a mutex -- i.e. a thread that wants to read or write the data structure locks the mutex, does whatever it needs to do, and then unlocks the mutex to make it available again to other threads.
Which brings me to the point: the memory-heap of a process is a data structure which is accessible by multiple threads. Does this mean that every call to default/non-overloaded new and delete is serialized by a process-global mutex, and is therefore a potential serialization-bottleneck that can slow down multithreaded programs? Or do modern heap implementations avoid or mitigate that problem somehow, and if so, how do they do it?
(Note: I'm tagging this question linux, to avoid the correct-but-uninformative "it's implementation-dependent" response, but I'd also be interested in hearing about how Windows and MacOS/X do it as well, if there are significant differences across implementations)
new and delete are thread safe
The following functions are required to be thread-safe:
The library versions of operator new and operator delete
User replacement versions of global operator new and operator delete
std::calloc, std::malloc, std::realloc, std::aligned_alloc, std::free
Calls to these functions that allocate or deallocate a particular unit of storage occur in a single total order, and each such deallocation call happens-before the next allocation (if any) in this order.
With gcc, new is implemented by delegating to malloc, and we see that their malloc does indeed use a lock. If you are worried about your allocation causing bottlenecks, write your own allocator.
Answer is yes, but in practice it is usually not a problem.
If it is a problem for you you may try replacing your implementation of malloc with tcmalloc that reduces, but does not eliminate possible contention(since there is only 1 heap that needs to be shared among threads and processes).
TCMalloc assigns each thread a thread-local cache. Small allocations are satisfied from the thread-local cache. Objects are moved from central data structures into a thread-local cache as needed, and periodic garbage collections are used to migrate memory back from a thread-local cache into the central data structures.
There are also other options like using custom allocators and/or specialized containers and/or redesigning your application.
As you tried to avoid the the answer is architecture/system dependant by trying to avoid the problem that multiple threads must serialize accesses, this only happens with heaps that grow or shrink when the program needs to expand it or return part of it to the system.
The first answer has to be simply it's implementation dependant, without any system dependencies, because normally, libraries get large chunks of memory to base the heap and they administer those internally, which makes the problem actually operating system and architecture independent.
The second answer is that, of course, if you have only one single heap for all the threads, you'll have a possible bottleneck in case all of the active threads compete for a single chunk of memory. There are several approaches to this, you can have a pool of heaps to allow parallelism, and make the different threads use different pools for their requests, think that the possible largest problem is in requesting memory, as this is the case when you have the bottleneck. On returning there's not such issue, as you can act more like a garbage collector in which you queue the returned chunks of memory and enqueue them for a thread to dispatch and put those chunks in the proper places to conserve the heaps integrities. Having multiple heaps allows even to classify them by priorities, by chunk sizes, etc. so the risk of collision is made low by the class or problem you are going to deal with. This is the case of operating system kernels like *BSD, which use several memory heaps, somewhat dedicated to the kind of use they are going to receive (there's one for the io-disk buffers, one for virtual memory mapped segments, one for process virtual memory space management, etc)
I recommend you to read The design and implementation of the FreeBSD Operating System which explains very well the approach used in the kernel of BSD systems. This is general enough and probably a great percentage of the other systems follow this or a very similar approach.

TBB tbb::memory_pool<tbb::scalable_allocator<char>> How to use it correctly?

I have doubt.
For tbb::memory_pool< tbb::scalable_allocator > shared_memory_pool_;
if that is instantiated in the main thread. And than, I called shared_memory_pool_.malloc(sizeof(my_class)) in a worker thread. Will tbb allocate that size of memory from the main heap, or would it allocate it from the thread "domain" so that the lock contention causes by the normal malloc() would still be avoided?
The tbb::memory_pool is based on the same internals as tbb::scalable_allocator. So, once the memory pool grabs the memory initially (in your case as you specified, from tbb::scalable_allocator as well), it will use the same mechanisms to distribute and reuse it across the threads. I.e. it is scalable and avoids the global lock as much as possible. Though, as the memory is still a common resource, some thread synchronization is unavoidable anyway. Specifically, I'd expect more contention for initial memory requests since the per-thread caches are not warm yet; and also the scalable_allocator tries to keep the balance between scalability and memory consumption thus it'll not go crazy with per-thread caches redistributing the memory among threads which is also kind of thread synchronization (though more scalable than a lock).
As for the [very] initial memory allocation by scalable_allocator, it goes through mmap or VirtualAlloc for big enough memory chunks and not through malloc.
Here is some useful descriptions about how to implement a memory pool correctly. Please note that according to that:
In our implementation, we tried to provide more general functionality
in thread-safe and scalable way. For that purpose, the implementation
of the memory pools is based on TBB scalable memory allocator and so
has similar speed and memory consumption properties.
Hope this helps.

C++ memory allocation mechanism performance comparison (tcmalloc vs. jemalloc)

I have an application which allocates lots of memory and I am considering using a better memory allocation mechanism than malloc.
My main options are: jemalloc and tcmalloc. Is there any benefits in using any of them over the other?
There is a good comparison between some mechanisms (including the author's proprietary mechanism -- lockless) in http://locklessinc.com/benchmarks.shtml
and it mentions some pros and cons of each of them.
Given that both of the mechanisms are active and constantly improving. Does anyone have any insight or experience about the relative performance of these two?
If I remember correctly, the main difference was with multi-threaded projects.
Both libraries try to de-contention memory acquire by having threads pick the memory from different caches, but they have different strategies:
jemalloc (used by Facebook) maintains a cache per thread
tcmalloc (from Google) maintains a pool of caches, and threads develop a "natural" affinity for a cache, but may change
This led, once again if I remember correctly, to an important difference in term of thread management.
jemalloc is faster if threads are static, for example using pools
tcmalloc is faster when threads are created/destructed
There is also the problem that since jemalloc spin new caches to accommodate new thread ids, having a sudden spike of threads will leave you with (mostly) empty caches in the subsequent calm phase.
As a result, I would recommend tcmalloc in the general case, and reserve jemalloc for very specific usages (low variation on the number of threads during the lifetime of the application).
I have recently considered tcmalloc for a project at work. This is what I observed:
Greatly improved performance for heavy usage of malloc in a multithreaded setting. I used it with a tool at work and the performance improved almost twofold. The reason is that in this tool there were a few threads performing allocations of small objects in a critical loop. Using glibc, the performance suffers because of, I think, lock contentions between malloc/free calls in different threads.
Unfortunately, tcmalloc increases the memory footprint. The tool I mentioned above would consume two or three times more memory (as measured by the maximum resident set size). The increased footprint is a no go for us since we are actually looking for ways to reduce memory footprint.
In the end I have decided not to use tcmalloc and instead optimize the application code directly: this means removing the allocations from the inner loops to avoid the malloc/free lock contentions. (For the curious, using a form of compression rather than using memory pools.)
The lesson for you would be that you should carefully measure your application with typical workloads. If you can afford the additional memory usage, tcmalloc could be great for you. If not, tcmalloc is still useful to see what you would gain by avoiding the frequent calls to memory allocation across threads.
Be aware that according to the 'nedmalloc' homepage, modern OS's allocators are actually pretty fast now:
"Windows 7, Linux 3.x, FreeBSD 8, Mac OS X 10.6 all contain state-of-the-art allocators and no third party allocator is likely to significantly improve on them in real world results"
http://www.nedprod.com/programs/portable/nedmalloc
So you might be able to get away with just recommending your users upgrade or something like it :)
You could also consider using Boehm conservative garbage collector. Basically, you replace every malloc in your source code with GC_malloc (etc...), and you don't bother calling free. Boehm's GC don't allocate memory more quickly than malloc (it is about the same, or can be 30% slower), but it has the advantage to deal with useless memory zones automatically, which might improve your program (and certainly eases coding, since you don't care any more about free). And Boehm's GC can also be used as a C++ allocator.
If you really think that malloc is too slow (but you should benchmark; most malloc-s take less than microsecond), and if you fully understand the allocating behavior of your program, you might replace some malloc-s with your special allocator (which could, for instance, get memory from the kernel in big chunks using mmap and manage memory by yourself). But I believe doing that is a pain. In C++ you have the allocator concept and std::allocator_traits, with most standard containers templates accepting such an allocator (see also std::allocator), e.g. the optional second template argument to std::vector, etc...
As others suggested, if you believe malloc is a bottleneck, you could allocate data in chunks (or using arenas), or just in an array.
Sometimes, implementing a specialized copying garbage collector (for some of your data) could help. Consider perhaps MPS.
But don't forget that premature optimization is evil and please benchmark & profile your application to understand exactly where time is lost.
There's a pretty good discussion about allocators here:
http://www.reddit.com/r/programming/comments/7o8d9/tcmalloca_faster_malloc_than_glibcs_open_sourced/
Your post do not mention threading, but before considering mixing C and C++ allocation methods, I would investigate the concept of memory pool.BOOST has a good one.

Can multithreading speed up memory allocation?

I'm working with an 8 core processor, and am using Boost threads to run a large program.
Logically, the program can be split into groups, where each group is run by a thread.
Inside each group, some classes invoke the 'new' operator a total of 10000 times.
Rational Quantify shows that the 'new' memory allocation is taking up the maximum processing time when the program runs, and is slowing down the entire program.
One way I can speed up the system could be to use threads inside each 'group', so that the 10000 memory allocations can happen in parallel.
I'm unclear of how the memory allocation will be managed here. Will the OS scheduler really be able to allocate memory in parallel?
Standard CRT
While with older of Visual Studio the default CRT allocator was blocking, this is no longer true at least for Visual Studio 2010 and newer, which calls corresponding OS functions directly. The Windows heap manager was blocking until Widows XP, in XP the optional Low Fragmentation Heap is not blocking, while the default one is, and newer OSes (Vista/Win7) use LFH by default. The performance of recent (Windows 7) allocators is very good, comparable to scalable replacements listed below (you still might prefer them if targeting older platforms or when you need some other features they provide). There exist several multiple "scalable allocators", with different licenses and different drawbacks. I think on Linux the default runtime library already uses a scalable allocator (some variant of PTMalloc).
Scalable replacements
I know about:
HOARD (GNU + commercial licenses)
MicroQuill SmartHeap for SMP (commercial license)
Google Perf Tools TCMalloc (BSD license)
NedMalloc (BSD license)
JemAlloc (BSD license)
PTMalloc (GNU, no Windows port yet?)
Intel Thread Building Blocks (GNU, commercial)
You might want to check Scalable memory allocator experiences for my experiences with trying to use some of them in a Windows project.
In practice most of them work by having a per thread cache and per thread pre-allocated regions for allocations, which means that small allocations most often happen inside of a context of thread only, OS services are called only infrequently.
Dynamic allocation of memory uses the heap of the application/module/process (but not thread). The heap can only handle one allocation request at a time. If you try to allocate memory in "parallel" threads, they will be handled in due order by the heap. You will not get a behaviour like: one thread is waiting to get its memory while another can ask for some, while a third one is getting some. The threads will have to line-up in queue to get their chunk of memory.
What you would need is a pool of heaps. Use whichever heap is not busy at the moment to allocate the memory. But then, you have to watch out throughout the life of this variable such that it does not get de-allocated on another heap (that would cause a crash).
I know that Win32 API has functions such as GetProcessHeap(), CreateHeap(), HeapAlloc() and HeapFree(), that allow you to create a new heap and allocate/deallocate memory from a specific heap HANDLE. I don't know of an equivalence in other operating systems (I have looked for them, but to no avail).
You should, of course, try to avoid doing frequent dynamic allocations. But if you can't, you might consider (for portability) to create your own "heap" class (doesn't have to be a heap per se, just a very efficient allocator) that can manage a large chunk of memory and surely a smart pointer class that would hold a reference to the heap from which it came. This would enable you to use multiple heaps (make sure they are thread-safe).
There are 2 scalable drop-in replacements for malloc that I know of:
Google's tcmalloc
Facebook's jemalloc (link to a performance study comparing to tcmalloc)
I don't have any experience with Hoard (which performed poorly in the study), but Emery Berger lurks on this site and was astonished by the results. He said he would have a look and I surmise there might have been some specifics to either the test or implementation that "trapped" Hoard as the general feedback is usually good.
One word of caution with jemalloc, it can waste a bit of space when you rapidly create then discard threads (as it creates a new pool for each thread you allocate from). If your threads are stable, there should not be any issue with this.
I believe the short answer to your question is : yes, probably. And as already pointed out by several people here there are ways to achieve this.
Aside from your question and the answers already posted here, it would be good to start with your expectations on improvements, because that will pretty much tell which path to take. Maybe you need to be 100x faster. Also, do you see yourself doing speed improvements in the near future as well or is there a level which will be good enough? Not knowing your application or problem domain it's difficult to also advice you specifically. Are you for instance in a problem domain where speed continuously have to be improved?
One good thing to start off with when doing performance improvements is to question if you need to do things the way you currently do it? In this case, can you pre-allocate objects? Is there a maximum number of X objects in the system? Could you re-use objects? All of this is better, because you don't necessarily need to do allocations on the critical path. E.g. if you can re-use objects, a custom allocator with pre-allocated objects would work well. Also, what OS are you on?
If you don't have concrete expectations or a certain level of performance, just start experimenting with any of the advices here and you'll find out more.
Good luck!
Roll your own non-multi-threaded new memory allocator a distinct copy of which each thread has.
(you can override new and delete)
So it's allocating in large chunks that it works through and doesn't need any locking as each is owned by a single thread.
limit your threads to the number of cores you have available.
new is pretty much blocking, it has to find the next free bit of memory which is tricky to do if you have lots of threads all asking for that at once.
Memory allocation is slow - if you are doing it more than a few times, especially on lots of threads then you need a redesign. Can you pre-allocate enough space at the start, can you just allocate a big chunk with 'new' and then partition it out yourself?
You need to check your compiler documentation whether it makes the allocator thread safe or not. If it does not, then you will need to overload your new operator and make it thread safe.
Else it will result in either a segfault or UB.
On some platforms like Windows, access to the global heap is serialized by the OS. Having a thread-separate heap could substantially improve allocation times.
Of course, in this case, it might be worth questioning whether or not you genuinely need heap allocation as opposed to some other form of dynamic allocation.
You may want to take a look at The Hoard Memory Allocator: "is a drop-in replacement for malloc() that can dramatically improve application performance, especially for multithreaded programs running on multiprocessors."
The best what you can try to reach ~8 memory allocation in parallel (since you have 8 physical cores), not 10000 as you wrote
standard malloc uses mutex and standard STL allocator does the same. Therefore it will not speed up automatically when you introduce threading.
Nevertheless, you can use another malloc library (google for e.g. "ptmalloc") which does not use global locking. if you allocate using STL (e.g. allocate strings, vectors) you have to write your own allocator.
Rather interesting article: http://developers.sun.com/solaris/articles/multiproc/multiproc.html

In multithreaded C/C++, does malloc/new lock the heap when allocating memory

I'm curious as to whether there is a lock on memory allocation if two threads simultaneously request to allocate memory. I am using OpenMP to do multithreading, C++ code.
OS's: mostly linux, but would like to know for Windows and Mac as well.
There could be improvements in certain implementations, such as creating a thread-specific cache (in this case allocations of small blocks will be lock-free). For instance, this from Google. But in general, yes, there is a lock on memory allocations.
By default Windows locks the heap when you use the Win API heap functions.
You can control the locking at least at the time of heap creation. Different compilers and C runtimes do different things with the malloc/free family. For example, the SmartHeap API at one point created one heap per thread and therefore needed no locking. There were also config options to turn that behavior on and off.
At one point in the early/mid '90s the Borland Windows and OS/2 compilers explicitly turned off Heap locking (a premature optimization bug) until multiple threads were launched with beginthread. Many many people tried to spawn threads with an OS API call and then were surprised when the heap corrupted itself all to hell...
http://en.wikipedia.org/wiki/Malloc
Modern malloc implementations try to be as lock-free as possible by keeping separate "arenas" for each thread.
Free store is a shared resource and must be synchronized. Allocation/deallocation is costly. If you are multithreading for performance, then frequent allocation/deallocation can become a bottleneck. As a general rule, avoid allocation/deallocation inside tight loops. Another problem is false sharing.