Performance cost of threading constructs: missed optimisations and memory allocation - c++

We are experiencing a strange phenomenon in which the inclusion of a header file results in a 5-10% performance penalty in certain memory-allocation-intensive workloads.
This header file declares a thread pool as a global variable. This thread pool is never used in any capacity (yet) in the application. That is, apart from the creation of this static thread pool at program startup, the application is completely single-threaded. The performance penalty disappears once the header is removed.
From a bit of researching, it seems like a multithreaded application can incur in some performance penalty due to certain compiler optimisations not being possible any more. Is it possible that such optimisations are being turned off whenever a threading-related construct is instantiated in any form or capacity?
Or, since the performance penalty seems to be most evident while performing numerous memory allocations, is it possible that during the compilation/linking phase the compiler realises that threading constructs are instantiated and thus it switches to a thread-safe memory allocator?
This happens on a Linux 64 bit workstation, with both GCC and clang. The standard threading primitives from C++11 are being used.
EDIT I should also probably mention that, according to our tests, when using the tcmalloc allocator instead of the default one, the performance difference seems to go away.

Multi-threaded malloc and some other checked functions incur a lock cost, consistent with what you are seeing. I would expect the malloc implementation to change to the threaded (and locked) version by some directive in the thread header file.
This is a reasonable cost, and results in a more understandable output from the program, at the cost of strange performance changes for single threaded examples.

Related

Allocating memory will block whole of threads? [duplicate]

I'm curious as to whether there is a lock on memory allocation if two threads simultaneously request to allocate memory. I am using OpenMP to do multithreading, C++ code.
OS's: mostly linux, but would like to know for Windows and Mac as well.
There could be improvements in certain implementations, such as creating a thread-specific cache (in this case allocations of small blocks will be lock-free). For instance, this from Google. But in general, yes, there is a lock on memory allocations.
By default Windows locks the heap when you use the Win API heap functions.
You can control the locking at least at the time of heap creation. Different compilers and C runtimes do different things with the malloc/free family. For example, the SmartHeap API at one point created one heap per thread and therefore needed no locking. There were also config options to turn that behavior on and off.
At one point in the early/mid '90s the Borland Windows and OS/2 compilers explicitly turned off Heap locking (a premature optimization bug) until multiple threads were launched with beginthread. Many many people tried to spawn threads with an OS API call and then were surprised when the heap corrupted itself all to hell...
http://en.wikipedia.org/wiki/Malloc
Modern malloc implementations try to be as lock-free as possible by keeping separate "arenas" for each thread.
Free store is a shared resource and must be synchronized. Allocation/deallocation is costly. If you are multithreading for performance, then frequent allocation/deallocation can become a bottleneck. As a general rule, avoid allocation/deallocation inside tight loops. Another problem is false sharing.

Performance vs. C++ memory model

With the new shared-memory concurrency features of C++11 it is possible that two threads can allocate memory at the same time. Furthermore, since the compiler does not know in advance if the compiled code will be run by multiple threads at the same time it has to assume for the worst. Thus, my conception would be that the compiled code has to synchronize trips to the heap in some way. This would then slow down single threaded code which does not need synchronization.
Wouldn't this be in contrast to the C++ dictum that "you only pay for what you use"? Is the overhead so small that it was not considered important? Are the other areas where the C++ memory model slows down code which in the end is only used single-threadedly?
Heap managers indeed need to synchronize, and that's a possible performance problem for multi-threaded code. It's up to the program to mitigate that if necessary. Standard libraries are also reacting, trying to get in better multi-threaded allocators.
Edit: Some thoughts about the questions in the second paragraph.
Even C++ needs to be sufficiently safe to be usable. "YDPFWYU" is nice, but if it means that you have to wrap a mutex around every allocation if you want to use the code in a multi-threaded environment, you have a big problem. It's like exceptions, really: even code that doesn't actively use them should be somewhat aware that it might be used in a context where they exist, and both the programmer and the compiler need to be aware of that. The compiler needs to create exception support code/data structures, while the programmer needs to write exception-safe code. Multi-threading is the same, only worse: any piece of code you write might be used in a multi-threaded environment, so you need to write thread-safe code, and the compiler/environment needs to be aware of threading (forgo some very unsafe optimizations, and have a thread-safe allocator).
These are the points in C++ where you pay even for what you don't use, as far as the standard is concerned. Your particular compiler might give you an escape hatch (disable exceptions, use single-threaded runtime library), but that's no longer real C++ then.
That said, even (or especially) if you have a single global allocator lock, the overhead for a single-threaded program is minimal: locks are only expensive when under contention. An uncontested mutex lock/unlock is not very significant compared to the rest of the allocator operation.
Under contention, the story is different, which is where custom allocators possibly come in.
As I briefly mentioned above, one other place in C++ is slowed down very slightly by the mere existence of multi-threading: the prohibition on some particular optimizations. The compiler cannot invent reads and writes (especially writes) to possibly shared variables (like globals or things you handed out a pointer to) in code paths that wouldn't ordinarily have these accesses. This may slow down very specific pieces of code, but overall in a program, it's very unlikely that you'll notice.
You are mixing allocation and access of heap memory.
Multi-threaded heap allocation is indeed synchronized, but at a C library level, at least in all modern (con)current OSes' C libraries. There may be specific-purpose C libraries that don't do this. See for example the old single- and multithreaded C runtime library for MSVC (note how new versions of MSVS deprecate and even remove single-threaded variants). I assume glibc has a similar mechanism, and is probably also solely multithreaded, and so always synchronized. I haven't heard anyone complain about multithreaded memory allocation speeds, so if you have a concrete complaint, I'd like to see it properly explained and documented with reproducible code.
Access of heap memory (i.e. after a call to new or malloc has returned) is not protected by any mechanism whatsoever. C++11 gives you mutex and other synchronization possibilities that you, as a user, need to implement in your code if you want to protect from race conditions. If you do not, no purrformance is lost.
The compilers really not forced to be NOT optimizing. Always were possible to made a very bad compilers and "standard" libraries. And nowadays it is nothing else but bad quality. Despite it may be advertised as the "only real right C++".
"any piece of code you write might be used in a multi-threaded environment, so you need to write thread-safe code, and the compiler/environment needs to be aware of threading" - it is a clear stupidness.
Good implementation always can provide a normal way to optimize the single threaded code (and necessary libraries...), and the code, which not using exceptions, and allow the other features...
(For example, threading requires some certain functions to coordinate threads and also to create threads, and at link time the use of them are visible and may affect toolchain... Or at first call to thread creation function it may affect the memory allocation method (and have other effects) And there may be other good ways, like special switches to compiler and etc...)
Not really. Both physical memory and backing store are system resources on modern operating systems. So allocations of them and accesses to them have to be properly synchronized.
The case of threads sharing virtual memory is just a special case of the many other ways scheduling entities can share virtual memory. Consider two processes that memory map the same library or data file.
The only extra overhead with threads is modifications to the virtual memory map because threads share a virtual memory map. Much of the synchronization overhead is unavoidable. For example, if you're unmapping something, some resource typically has to be returned to a system-level pool, and that requires synchronization anyway.
On many platforms, some special platform-specific thing is needed to let other threads running at the same time know that their view of virtual memory has changed. But this disappears anyway if there are no other threads since there would be nothing to notify.
It is simply a reality that some features have costs even if they're not being used. The existence of swap logic and checks in your kernel has some costs even if you never swap. Engineers are realists and have to balance costs and benefits.
Both physical memory and backing store are system resources on modern operating systems. So allocations of them and accesses to them have to be properly synchronized.
The case of threads sharing virtual memory is just a special case of the many other ways scheduling entities can share virtual memory.
As soon it is features of Operating System, there are no need to use additional code in the C/C++ allocation functions in application program (really, of course, multi threading needs special additional synchronization in Standard Lib. and additional "system calls" and see the question at beginning).
Real trouble may be having many types (single- and multi - threading) of the same library (Standard C/C++ library and also others) in the system... But...

Does pthread_mutex lock provide higher performance than user imposed memory barrier in code

Problem Background
The code in question is related to C++ implementation. We have code base where for certain critical implementation, we do use asm volatile ("mfence":"memory").
My understanding of memory barriers is -
It is used to ensure complete/ordered execution of the instruction set.
It will help avoidance of classical thread synchronization problem - Wiki link.
Question
Is pthread_mutext faster than the memory barrier in case we use memory fence to avoid thread synchronization problem? I have read contents which indicates that pthread mutex uses memory synchronization.
PS :
In our code, the use of asm volatile ("mfence":"memory") is used after a 10-15 lines of c++ code (of member function). So my doubt is - may be a mutext implementation of the memory synchronization gives better performance than that of MB in user implemented code (w.r.t scope of MB).
We are using SUSE Linux 10, 2.6.16.46, smp#1, x64_86 with quad core processor.
pthread mutexes are guaranteed to be slower than a memory fence instruction (I can't say how much slower, that is entirely platform dependent). The reason is simply becuase in order to be compliant posix mutexes, they must include memory guarantees. The posix mutexes have strong memory guarantees, and thus I can't see how they would be implemented without such fences*.
If you're looking for practical advice I use fences in many places instead of mutexes and have timed both of them frequently. pthread_mutexes are very slow on Linux compared to just a raw memory fence (of course, they do a lot more, so be careful what you are actually comparing).
Note however that certain atomic operations, in particular those in C++11, could, and certainly will, be faster then you using fences all over. In this case the compiler/library understands the architecture and need not use the full fence in order to provide the memory guarantees.
Also note, I'm talking about very low-level performance of the lock itself. You need to be profiling to the nanosecond level.
*It is possible to imagine a mutex system which ignores certain types of memory and chooses a more lenient locking implementation (such as relying on ordering guarantees of normal memory and ignored specially marked memory). I would argue such an implementation is however not valid.

C++ memory allocation mechanism performance comparison (tcmalloc vs. jemalloc)

I have an application which allocates lots of memory and I am considering using a better memory allocation mechanism than malloc.
My main options are: jemalloc and tcmalloc. Is there any benefits in using any of them over the other?
There is a good comparison between some mechanisms (including the author's proprietary mechanism -- lockless) in http://locklessinc.com/benchmarks.shtml
and it mentions some pros and cons of each of them.
Given that both of the mechanisms are active and constantly improving. Does anyone have any insight or experience about the relative performance of these two?
If I remember correctly, the main difference was with multi-threaded projects.
Both libraries try to de-contention memory acquire by having threads pick the memory from different caches, but they have different strategies:
jemalloc (used by Facebook) maintains a cache per thread
tcmalloc (from Google) maintains a pool of caches, and threads develop a "natural" affinity for a cache, but may change
This led, once again if I remember correctly, to an important difference in term of thread management.
jemalloc is faster if threads are static, for example using pools
tcmalloc is faster when threads are created/destructed
There is also the problem that since jemalloc spin new caches to accommodate new thread ids, having a sudden spike of threads will leave you with (mostly) empty caches in the subsequent calm phase.
As a result, I would recommend tcmalloc in the general case, and reserve jemalloc for very specific usages (low variation on the number of threads during the lifetime of the application).
I have recently considered tcmalloc for a project at work. This is what I observed:
Greatly improved performance for heavy usage of malloc in a multithreaded setting. I used it with a tool at work and the performance improved almost twofold. The reason is that in this tool there were a few threads performing allocations of small objects in a critical loop. Using glibc, the performance suffers because of, I think, lock contentions between malloc/free calls in different threads.
Unfortunately, tcmalloc increases the memory footprint. The tool I mentioned above would consume two or three times more memory (as measured by the maximum resident set size). The increased footprint is a no go for us since we are actually looking for ways to reduce memory footprint.
In the end I have decided not to use tcmalloc and instead optimize the application code directly: this means removing the allocations from the inner loops to avoid the malloc/free lock contentions. (For the curious, using a form of compression rather than using memory pools.)
The lesson for you would be that you should carefully measure your application with typical workloads. If you can afford the additional memory usage, tcmalloc could be great for you. If not, tcmalloc is still useful to see what you would gain by avoiding the frequent calls to memory allocation across threads.
Be aware that according to the 'nedmalloc' homepage, modern OS's allocators are actually pretty fast now:
"Windows 7, Linux 3.x, FreeBSD 8, Mac OS X 10.6 all contain state-of-the-art allocators and no third party allocator is likely to significantly improve on them in real world results"
http://www.nedprod.com/programs/portable/nedmalloc
So you might be able to get away with just recommending your users upgrade or something like it :)
You could also consider using Boehm conservative garbage collector. Basically, you replace every malloc in your source code with GC_malloc (etc...), and you don't bother calling free. Boehm's GC don't allocate memory more quickly than malloc (it is about the same, or can be 30% slower), but it has the advantage to deal with useless memory zones automatically, which might improve your program (and certainly eases coding, since you don't care any more about free). And Boehm's GC can also be used as a C++ allocator.
If you really think that malloc is too slow (but you should benchmark; most malloc-s take less than microsecond), and if you fully understand the allocating behavior of your program, you might replace some malloc-s with your special allocator (which could, for instance, get memory from the kernel in big chunks using mmap and manage memory by yourself). But I believe doing that is a pain. In C++ you have the allocator concept and std::allocator_traits, with most standard containers templates accepting such an allocator (see also std::allocator), e.g. the optional second template argument to std::vector, etc...
As others suggested, if you believe malloc is a bottleneck, you could allocate data in chunks (or using arenas), or just in an array.
Sometimes, implementing a specialized copying garbage collector (for some of your data) could help. Consider perhaps MPS.
But don't forget that premature optimization is evil and please benchmark & profile your application to understand exactly where time is lost.
There's a pretty good discussion about allocators here:
http://www.reddit.com/r/programming/comments/7o8d9/tcmalloca_faster_malloc_than_glibcs_open_sourced/
Your post do not mention threading, but before considering mixing C and C++ allocation methods, I would investigate the concept of memory pool.BOOST has a good one.

Overhead of a Memory Barrier / Fence

I'm currently writing C++ code and use a lot of memory barriers / fences in my code. I know, that a MB tolds the compiler and the hardware to not reorder write/reads around it. But i don't know how complex this operation is for the processor at runtime.
My Question is: What is the runtime-overhead of such a barrier? I didn't found any useful answer with google...
Is the overhead negligible? Or leads heavy usage of MBs to serious performance problems?
Best regards.
Compared to arithmetic and "normal" instructions I understand these to be very costly, but do not have numbers to back up that statement. I like jalf's answer by describing effects of the instructions, and would like to add a bit.
There are in general a few different types of barriers, so understanding the differences could be helpful. A barrier like the one that jalf mentioned is required for example in a mutex implementation before clearing the lock word (lwsync on ppc, or st4.rel on ia64 for example). All reads and writes must be complete, and only instructions later in the pipeline that have no memory access and no dependencies on in progress memory operations can be executed.
Another type of barrier is the sort that you'd use in a mutex implementation when acquiring a lock (examples, isync on ppc, or instr.acq on ia64). This has an effect on future instructions, so if a non-dependent load has been prefetched it must be discarded. Example:
if ( pSharedMem->atomic.bit_is_set() ) // use a bit to flag that somethingElse is "ready"
{
foo( pSharedMem->somethingElse ) ;
}
Without an acquire barrier (borrowing ia64 lingo), your program may have unexpected results if somethingElse made it into a register before the check of the flagging bit check is complete.
There is a third type of barrier, generally less used, and is required to enforce store load ordering. Examples of instructions for such an ordering enforcing instruction are, sync on ppc (heavyweight sync), MF on ia64, membar #storeload on sparc (required even for TSO).
Using ia64 like pseudocode to illustrate, suppose one had
st4.rel
ld4.acq
without an mf in between one has no guarentee that the load follows the store. You know that loads and stores preceding the st4.rel are done before that store or the "subsequent" load, but that load or other future loads (and perhaps stores if non-dependent?) could sneak in, completing earlier since nothing prevents that otherwise.
Because mutex implementations very likely only use acquire and release barriers in thier implementations, I'd expect that an observable effect of this is that memory access following lock release may actually sometimes occur while "still in the critical section".
Try thinking about what the instruction does. It doesn't make the CPU do anything complicated in terms of logic, but it forces it to wait until all reads and writes have been committed to main memory. So the cost really depends on the cost of accessing main memory (and the number of outstanding reads/writes).
Accessing main memory is generally pretty expensive (10-200 clock cycles), but in a sense, that work would have to be done without the barrier as well, it could just be hidden by executing some other instructions simultaneously so you didn't feel the cost so much.
It also limits the CPU's (and compilers) ability to reschedule instructions, so there may be an indirect cost as well in that nearby instructions can't be interleaved which might otherwise yield a more efficient execution schedule.