Performance vs. C++ memory model - c++

With the new shared-memory concurrency features of C++11 it is possible that two threads can allocate memory at the same time. Furthermore, since the compiler does not know in advance if the compiled code will be run by multiple threads at the same time it has to assume for the worst. Thus, my conception would be that the compiled code has to synchronize trips to the heap in some way. This would then slow down single threaded code which does not need synchronization.
Wouldn't this be in contrast to the C++ dictum that "you only pay for what you use"? Is the overhead so small that it was not considered important? Are the other areas where the C++ memory model slows down code which in the end is only used single-threadedly?

Heap managers indeed need to synchronize, and that's a possible performance problem for multi-threaded code. It's up to the program to mitigate that if necessary. Standard libraries are also reacting, trying to get in better multi-threaded allocators.
Edit: Some thoughts about the questions in the second paragraph.
Even C++ needs to be sufficiently safe to be usable. "YDPFWYU" is nice, but if it means that you have to wrap a mutex around every allocation if you want to use the code in a multi-threaded environment, you have a big problem. It's like exceptions, really: even code that doesn't actively use them should be somewhat aware that it might be used in a context where they exist, and both the programmer and the compiler need to be aware of that. The compiler needs to create exception support code/data structures, while the programmer needs to write exception-safe code. Multi-threading is the same, only worse: any piece of code you write might be used in a multi-threaded environment, so you need to write thread-safe code, and the compiler/environment needs to be aware of threading (forgo some very unsafe optimizations, and have a thread-safe allocator).
These are the points in C++ where you pay even for what you don't use, as far as the standard is concerned. Your particular compiler might give you an escape hatch (disable exceptions, use single-threaded runtime library), but that's no longer real C++ then.
That said, even (or especially) if you have a single global allocator lock, the overhead for a single-threaded program is minimal: locks are only expensive when under contention. An uncontested mutex lock/unlock is not very significant compared to the rest of the allocator operation.
Under contention, the story is different, which is where custom allocators possibly come in.
As I briefly mentioned above, one other place in C++ is slowed down very slightly by the mere existence of multi-threading: the prohibition on some particular optimizations. The compiler cannot invent reads and writes (especially writes) to possibly shared variables (like globals or things you handed out a pointer to) in code paths that wouldn't ordinarily have these accesses. This may slow down very specific pieces of code, but overall in a program, it's very unlikely that you'll notice.

You are mixing allocation and access of heap memory.
Multi-threaded heap allocation is indeed synchronized, but at a C library level, at least in all modern (con)current OSes' C libraries. There may be specific-purpose C libraries that don't do this. See for example the old single- and multithreaded C runtime library for MSVC (note how new versions of MSVS deprecate and even remove single-threaded variants). I assume glibc has a similar mechanism, and is probably also solely multithreaded, and so always synchronized. I haven't heard anyone complain about multithreaded memory allocation speeds, so if you have a concrete complaint, I'd like to see it properly explained and documented with reproducible code.
Access of heap memory (i.e. after a call to new or malloc has returned) is not protected by any mechanism whatsoever. C++11 gives you mutex and other synchronization possibilities that you, as a user, need to implement in your code if you want to protect from race conditions. If you do not, no purrformance is lost.

The compilers really not forced to be NOT optimizing. Always were possible to made a very bad compilers and "standard" libraries. And nowadays it is nothing else but bad quality. Despite it may be advertised as the "only real right C++".
"any piece of code you write might be used in a multi-threaded environment, so you need to write thread-safe code, and the compiler/environment needs to be aware of threading" - it is a clear stupidness.
Good implementation always can provide a normal way to optimize the single threaded code (and necessary libraries...), and the code, which not using exceptions, and allow the other features...
(For example, threading requires some certain functions to coordinate threads and also to create threads, and at link time the use of them are visible and may affect toolchain... Or at first call to thread creation function it may affect the memory allocation method (and have other effects) And there may be other good ways, like special switches to compiler and etc...)

Not really. Both physical memory and backing store are system resources on modern operating systems. So allocations of them and accesses to them have to be properly synchronized.
The case of threads sharing virtual memory is just a special case of the many other ways scheduling entities can share virtual memory. Consider two processes that memory map the same library or data file.
The only extra overhead with threads is modifications to the virtual memory map because threads share a virtual memory map. Much of the synchronization overhead is unavoidable. For example, if you're unmapping something, some resource typically has to be returned to a system-level pool, and that requires synchronization anyway.
On many platforms, some special platform-specific thing is needed to let other threads running at the same time know that their view of virtual memory has changed. But this disappears anyway if there are no other threads since there would be nothing to notify.
It is simply a reality that some features have costs even if they're not being used. The existence of swap logic and checks in your kernel has some costs even if you never swap. Engineers are realists and have to balance costs and benefits.

Both physical memory and backing store are system resources on modern operating systems. So allocations of them and accesses to them have to be properly synchronized.
The case of threads sharing virtual memory is just a special case of the many other ways scheduling entities can share virtual memory.
As soon it is features of Operating System, there are no need to use additional code in the C/C++ allocation functions in application program (really, of course, multi threading needs special additional synchronization in Standard Lib. and additional "system calls" and see the question at beginning).
Real trouble may be having many types (single- and multi - threading) of the same library (Standard C/C++ library and also others) in the system... But...

Related

Is access to the heap serialized?

One rule every programmer quickly learns about multithreading is:
If more than one thread has access to a data structure, and at least one of threads might modify that data structure, then you'd better serialize all accesses to that data structure, or you're in for a world of debugging pain.
Typically this serialization is done via a mutex -- i.e. a thread that wants to read or write the data structure locks the mutex, does whatever it needs to do, and then unlocks the mutex to make it available again to other threads.
Which brings me to the point: the memory-heap of a process is a data structure which is accessible by multiple threads. Does this mean that every call to default/non-overloaded new and delete is serialized by a process-global mutex, and is therefore a potential serialization-bottleneck that can slow down multithreaded programs? Or do modern heap implementations avoid or mitigate that problem somehow, and if so, how do they do it?
(Note: I'm tagging this question linux, to avoid the correct-but-uninformative "it's implementation-dependent" response, but I'd also be interested in hearing about how Windows and MacOS/X do it as well, if there are significant differences across implementations)
new and delete are thread safe
The following functions are required to be thread-safe:
The library versions of operator new and operator delete
User replacement versions of global operator new and operator delete
std::calloc, std::malloc, std::realloc, std::aligned_alloc, std::free
Calls to these functions that allocate or deallocate a particular unit of storage occur in a single total order, and each such deallocation call happens-before the next allocation (if any) in this order.
With gcc, new is implemented by delegating to malloc, and we see that their malloc does indeed use a lock. If you are worried about your allocation causing bottlenecks, write your own allocator.
Answer is yes, but in practice it is usually not a problem.
If it is a problem for you you may try replacing your implementation of malloc with tcmalloc that reduces, but does not eliminate possible contention(since there is only 1 heap that needs to be shared among threads and processes).
TCMalloc assigns each thread a thread-local cache. Small allocations are satisfied from the thread-local cache. Objects are moved from central data structures into a thread-local cache as needed, and periodic garbage collections are used to migrate memory back from a thread-local cache into the central data structures.
There are also other options like using custom allocators and/or specialized containers and/or redesigning your application.
As you tried to avoid the the answer is architecture/system dependant by trying to avoid the problem that multiple threads must serialize accesses, this only happens with heaps that grow or shrink when the program needs to expand it or return part of it to the system.
The first answer has to be simply it's implementation dependant, without any system dependencies, because normally, libraries get large chunks of memory to base the heap and they administer those internally, which makes the problem actually operating system and architecture independent.
The second answer is that, of course, if you have only one single heap for all the threads, you'll have a possible bottleneck in case all of the active threads compete for a single chunk of memory. There are several approaches to this, you can have a pool of heaps to allow parallelism, and make the different threads use different pools for their requests, think that the possible largest problem is in requesting memory, as this is the case when you have the bottleneck. On returning there's not such issue, as you can act more like a garbage collector in which you queue the returned chunks of memory and enqueue them for a thread to dispatch and put those chunks in the proper places to conserve the heaps integrities. Having multiple heaps allows even to classify them by priorities, by chunk sizes, etc. so the risk of collision is made low by the class or problem you are going to deal with. This is the case of operating system kernels like *BSD, which use several memory heaps, somewhat dedicated to the kind of use they are going to receive (there's one for the io-disk buffers, one for virtual memory mapped segments, one for process virtual memory space management, etc)
I recommend you to read The design and implementation of the FreeBSD Operating System which explains very well the approach used in the kernel of BSD systems. This is general enough and probably a great percentage of the other systems follow this or a very similar approach.

Does c++11 atomic automatically solves multi-core race on variable read-write?

I know that atomic will apply a lock on type "T" variable when multiple threads are reading and writing the variable, making sure only one of them is doing the R/W.
But in a multi cpu-core computer, threads can run on different cores, and different cores would have different L1-cache, L2-cache, while share L3-cache. We know sometimes C++ compiler will optimize a variable to be stored inside register, so that if a variable is not stored in memory, then there's no memory synchronization between different core-cache on the variable.
So my worry/question is, if an atomic variable is optimized to be some register variable by compiler, then it's not stored in memory, when one core writes its value, another core could read out a stale value, right? Is there any guarantee on this data consistency?
Thanks.
Atomic doesn't "solve" things the way you vaguely describe. It provides certain very specific guarantees onvolving consistency of memory based on order.
Various compilers implement these guarantees in different ways on different platforms.
On x86/64 no locks are used for atomic integers and pointers up to a reasonable size. And the hardware provises stronger guarantees than the standard requires, making some of the more esoteric options equivalent to full consistency.
I won't be able to fully answer your question but I can point you in the right direction; the topic you need to learn about is "the C++ memory model".
That being said, atomics exist in order to avoid the exact problem you describe. If you ask for full memory order consistency, and thread A modifies X then Y, no other thread can see Y modified but not X. How that guarantee is provided is not specified by the C++ standard; cache line invalidation, using special instructions for access, barring certain register-based optimizations by the compiler, etc are all the kind of thing that compilers do.
Note that the C++ memory model was refined, bugfixed and polished for C++17 in order to describe the behaviour of the new parallel algorithms and permit their efficient implementation on GPU hardware (among other spots) with the right flags, and in turn it influenced the guarantees that new GPU hardware provides. So people talking about memory models may be excited and talk about more modern issues than your mainly C++11 concerns.
This is a big complex topic. It is really easy to write code you think is portable, yet only works on a specific platform, or only usually works on the platform you tested it on. But that is just because threading is hard.
You may be looking for this:
[intro.progress]/18 An implementation should ensure that the last value (in modification order) assigned by an atomic or synchronization operation will become visible to all other threads in a finite period of time.

Performance cost of threading constructs: missed optimisations and memory allocation

We are experiencing a strange phenomenon in which the inclusion of a header file results in a 5-10% performance penalty in certain memory-allocation-intensive workloads.
This header file declares a thread pool as a global variable. This thread pool is never used in any capacity (yet) in the application. That is, apart from the creation of this static thread pool at program startup, the application is completely single-threaded. The performance penalty disappears once the header is removed.
From a bit of researching, it seems like a multithreaded application can incur in some performance penalty due to certain compiler optimisations not being possible any more. Is it possible that such optimisations are being turned off whenever a threading-related construct is instantiated in any form or capacity?
Or, since the performance penalty seems to be most evident while performing numerous memory allocations, is it possible that during the compilation/linking phase the compiler realises that threading constructs are instantiated and thus it switches to a thread-safe memory allocator?
This happens on a Linux 64 bit workstation, with both GCC and clang. The standard threading primitives from C++11 are being used.
EDIT I should also probably mention that, according to our tests, when using the tcmalloc allocator instead of the default one, the performance difference seems to go away.
Multi-threaded malloc and some other checked functions incur a lock cost, consistent with what you are seeing. I would expect the malloc implementation to change to the threaded (and locked) version by some directive in the thread header file.
This is a reasonable cost, and results in a more understandable output from the program, at the cost of strange performance changes for single threaded examples.

How does a mutex ensure a variable's value is consistent across cores?

If I have a single int which I want to write to from one thread and read from on another, I need to use std::atomic, to ensure that its value is consistent across cores, regardless of whether or not the instructions that read from and write to it are conceptually atomic. If I don't, it may be that the reading core has an old value in its cache, and will not see the new value. This makes sense to me.
If I have some complex data type that cannot be read/written to atomically, I need to guard access to it using some synchronisation primitive, such as std::mutex. This will prevent the object getting into (or being read from) an inconsistent state. This makes sense to me.
What doesn't make sense to me is how mutexes help with the caching problem that atomics solve. They seem to exist solely to prevent concurrent access to some resource, but not to propagate any values contained within that resource to other cores' caches. Is there some part of their semantics I've missed which deals with this?
The right answer to this is magic pixies - e.g. It Just Works. The implementation of std::atomic for each platform must do the right thing.
The right thing is a combination of 3 parts.
Firstly, the compiler needs to know that it can't move instructions across boundaries [in fact it can in some cases, but assume that it doesn't].
Secondly, the cache/memory subsystem needs to know - this is generally done using memory barriers, although x86/x64 generally have such strong memory guarantees that this isn't necessary in the vast majority of cases (which is a big shame as its nice for wrong code to actually go wrong).
Finally the CPU needs to know it cannot reorder instructions. Modern CPUs are massively aggressive at reordering operations and making sure in the single threaded case that this is unnoticeable. They may need more hints that this cannot happen in certain places.
For most CPUs part 2 and 3 come down to the same thing - a memory barrier implies both. Part 1 is totally inside the compiler, and is down to the compiler writers to get right.
See Herb Sutters talk 'Atomic Weapons' for a lot more interesting info.
The consistency across cores is ensured by memory barriers (which also prevents instructions reordering). When you use std::atomic, not only do you access the data atomically, but the compiler (and library) also insert the relevant memory barriers.
Mutexes work the same way: the mutex implementations (eg. pthreads or WinAPI or what not) internally also insert memory barriers.
Most modern multicore processors (including x86 and x64) are cache coherent. If two cores hold the same memory location in cache and one of them updates the value, the change is automatically propagated to other cores' caches. It's inefficient (writing to the same cache line at the same time from two cores is really slow) but without cache coherence it would be very difficult to write multithreaded software.
And like syam said, memory barriers are also required. They prevent the compiler or processor from reordering memory accesses, and also force the write into memory (or at least into cache), when for example a variable is held in a register because of compiler optizations.

Does pthread_mutex lock provide higher performance than user imposed memory barrier in code

Problem Background
The code in question is related to C++ implementation. We have code base where for certain critical implementation, we do use asm volatile ("mfence":"memory").
My understanding of memory barriers is -
It is used to ensure complete/ordered execution of the instruction set.
It will help avoidance of classical thread synchronization problem - Wiki link.
Question
Is pthread_mutext faster than the memory barrier in case we use memory fence to avoid thread synchronization problem? I have read contents which indicates that pthread mutex uses memory synchronization.
PS :
In our code, the use of asm volatile ("mfence":"memory") is used after a 10-15 lines of c++ code (of member function). So my doubt is - may be a mutext implementation of the memory synchronization gives better performance than that of MB in user implemented code (w.r.t scope of MB).
We are using SUSE Linux 10, 2.6.16.46, smp#1, x64_86 with quad core processor.
pthread mutexes are guaranteed to be slower than a memory fence instruction (I can't say how much slower, that is entirely platform dependent). The reason is simply becuase in order to be compliant posix mutexes, they must include memory guarantees. The posix mutexes have strong memory guarantees, and thus I can't see how they would be implemented without such fences*.
If you're looking for practical advice I use fences in many places instead of mutexes and have timed both of them frequently. pthread_mutexes are very slow on Linux compared to just a raw memory fence (of course, they do a lot more, so be careful what you are actually comparing).
Note however that certain atomic operations, in particular those in C++11, could, and certainly will, be faster then you using fences all over. In this case the compiler/library understands the architecture and need not use the full fence in order to provide the memory guarantees.
Also note, I'm talking about very low-level performance of the lock itself. You need to be profiling to the nanosecond level.
*It is possible to imagine a mutex system which ignores certain types of memory and chooses a more lenient locking implementation (such as relying on ordering guarantees of normal memory and ignored specially marked memory). I would argue such an implementation is however not valid.