What is a memory fence? - concurrency

What is meant by using an explicit memory fence?

For performance gains modern CPUs often execute instructions out of order to make maximum use of the available silicon (including memory read/writes). Because the hardware enforces instructions integrity you never notice this in a single thread of execution. However for multiple threads or environments with volatile memory (memory mapped I/O for example) this can lead to unpredictable behavior.
A memory fence/barrier is a class of instructions that mean memory read/writes occur in the order you expect. For example a 'full fence' means all read/writes before the fence are comitted before those after the fence.
Note memory fences are a hardware concept. In higher level languages we are used to dealing with mutexes and semaphores - these may well be implemented using memory fences at the low level and explicit use of memory barriers are not necessary. Use of memory barriers requires a careful study of the hardware architecture and more commonly found in device drivers than application code.
The CPU reordering is different from compiler optimisations - although the artefacts can be similar. You need to take separate measures to stop the compiler reordering your instructions if that may cause undesirable behaviour (e.g. use of the volatile keyword in C).

Copying my answer to another question, What are some tricks that a processor does to optimize code?:
The most important one would be memory access reordering.
Absent memory fences or serializing instructions, the processor is free to reorder memory accesses. Some processor architectures have restrictions on how much they can reorder; Alpha is known for being the weakest (i.e., the one which can reorder the most).
A very good treatment of the subject can be found in the Linux kernel source documentation, at Documentation/memory-barriers.txt.
Most of the time, it's best to use locking primitives from your compiler or standard library; these are well tested, should have all the necessary memory barriers in place, and are probably quite optimized (optimizing locking primitives is tricky; even the experts can get them wrong sometimes).

In my experience it refers to a memory barrier, which is an instruction (explicit or implicit) to synchronize memory access between multiple threads.
The problem occurs in the combination of modern agressive compilers (they have amazing freedom to reorder instructions, but usually know nothing of your threads) and modern multicore CPUs.
A good introduction to the problem is the "The 'Double-Checked Locking is Broken' Declaration". For many, it was the wake-up call that there be dragons.
Implicit full memory barriers are usually included in platform thread synchronization routines, which cover the core of it. However, for lock-free programming and implementing custom, lightweight synchronization patterns, you often need just the barrier, or even a one-way barrier only.

Wikipedia knows all...
Memory barrier, also known as membar
or memory fence, is a class of
instructions which cause a central
processing unit (CPU) to enforce an
ordering constraint on memory
operations issued before and after the
barrier instruction.
CPUs employ performance optimizations
that can result in out-of-order
execution, including memory load and
store operations. Memory operation
reordering normally goes unnoticed
within a single thread of execution,
but causes unpredictable behaviour in
concurrent programs and device drivers
unless carefully controlled. The exact
nature of an ordering constraint is
hardware dependent, and defined by the
architecture's memory model. Some
architectures provide multiple
barriers for enforcing different
ordering constraints.
Memory barriers are typically used
when implementing low-level machine
code that operates on memory shared by
multiple devices. Such code includes
synchronization primitives and
lock-free data structures on
multiprocessor systems, and device
drivers that communicate with computer
hardware.

memory fence(memory barrier) is a kind of lock-free mechanism for synchronisation multiple threads. In a single thread envirompment reordering is safe.
The problem is ordering, shared resource and caching. Processor or compiler is able to reorder a program instruction(programmer order) for optimisation. It creates side effects in multithread envirompment. That is why memory barrier was introduce to guarantee that program will work properly. It is slower but it fixes this type of issue
[Java Happens-before]
[iOS Memory Barriers]

Related

Is memory fence and memory barrier same?

here I am confused with the term memory fence (fence function in rust). I can clearly understand what is memory barrier in terms of atomics but I was unable to figure out what is memory fence.
Are memory fence and memory barriers the same? if not what is the difference and when to use memory fence over memory barrier?
A "fence" in this context is a kind of memory barrier. This distinction is important. For the purposes of this discussion I'll distinguish informally between three kinds of beasts:
Atomic fence: controls the order in which observers can see the effects of atomic memory operations. (This is what you asked about.)
More general memory barrier: controls the order of actual operations against memory or memory-mapped I/O. This is often a bigger hammer that can achieve similar results to an atomic fence, but at higher cost. (Depends on the architecture.)
Compiler fence: controls the order of instructions the processor receives. This is not what you asked about, but people often accidentally use this in place of a real barrier, which makes them sad later.
What fence is
Rust's std::sync::atomic::fence provides an atomic fence operation, which provides synchronization between other atomic fences and atomic memory operations. The terms folks use for describing the various atomic conditions can be a little daunting at first, but they are pretty well defined in the docs, though at the time of this writing there are some omissions. Here are the docs I suggest reading if you want to learn more.
First, Rust's docs for the Ordering type. This is a pretty good description of how operations with different Ordering interact, with less jargon than a lot of references in this area (atomic memory orderings). However, at the time of this writing, it's misleading for your specific question, because it says things like
This ordering is only applicable for operations that can perform a store.
which ignores the existence of fence.
The docs for fence go a little ways to repair that. IMO the docs in this area could use some love.
However, if you want all the interactions precisely laid out, I'm afraid you must look to a different source: the equivalent C++ docs. I know, we're not writing C++, but Rust inherits a lot of this behavior from LLVM, and LLVM tries to follow the C++ standard here. The C++ docs are much higher in jargon, but if you read slowly it's not actually more complex than the Rust docs -- just jargony. The nice thing about the C++ docs is that they discuss each interaction case between load/store/fence and load/store/fence.
What fence is not
The most common place that I employ memory barriers is to reason about completion of writes to memory-mapped I/O in low level code, such as drivers. (This is because I tend to work low in the stack, so this may not apply to your case.) In this case, you are likely performing volatile memory accesses, and you want barriers that are stronger than what fence offers.
In particular, fence helps you reason about which atomic memory operations are visible to which other atomic memory operations -- it does not help you reason about whether a particular stored value has made it all the way through the memory hierarchy and onto a particular level of the bus. For instance. For cases like that, you need a different sort of memory barrier.
These are the sorts of barriers described in considerable detail in the Linux Kernel's documentation on memory barriers.
In response to another answer on this question that flat stated that fence and barrier are equivalent, I raised this case on the Rust Unsafe Code Guidelines issue tracker and got some clarifications.
In particular, you might notice that the docs for Ordering and fence make no mention of how they interact with volatile memory accesses, and that's because they do not. Or at least, they aren't guaranteed to -- on certain architectures the instructions that need to be generated are the same (ARM), and in other cases, they are not (PowerPC).
Rust currently provides a portable atomic fence (which you found), but does not provide portable versions of any other sort of memory barrier, like those provided in the Linux kernel. If you need to reason about the completion of (for example) volatile memory accesses, you will need either non-portable asm! or a function/macro that winds up producing it.
Aside: compiler fences
When I make statements like what I said above, someone inevitably hops in with (GCC syntax)
asm("" :::: memory);
This is neither an atomic fence nor a memory barrier: it is roughly equivalent to Rust's compiler_fence, in that it discourages the compiler from reordering memory accesses across that point in the generated code. It has no effect on the order that the instructions are started or finished by the machine.
There is no difference.
"Fence" and "barrier" mean the same thing in this context.

Will other threads see a write to a `volatile` word sized variable in reasonable time?

When asking about a more specific problem I discovered this is the core issue where people are not exactly sure.
The following assumptions can be made:
CPU does use a cache coherency protocol like MESI(F) (examples: x86/x86_64 and ARMv7mp)
variable is assumed to be of a size which is atomically written/read by the processor (aligned and native word size)
The variable is declared volatile
The questions are:
If I write to the variable in one thread, will other threads see the change?
What is the order of magnitude of the timeframe in which the other threads will see the change?
Do you know of architectures where cache coherency is not enough to ensure cross-CPU / cross-core visibility?
The question is NOT:
Is it safe to use such a variable?
about reordering issues
about C++11 atomics
This might be considered a duplicate of In C/C++, are volatile variables guaranteed to have eventually consistent semantics betwen threads? and other similar questions, but I think none of these have those clear requirements regarding the target architecture which leads to a lot of confusion about differing assumptions.
Do you know of architectures where cache coherency is not enough to insure cross-cpu / cross-core visibility?
I"m not aware of any single processor with multiple cores that has cache coherency issues. It might be possible for someone to use the wrong type of processor in a multi-processor board, for example an Intel processor that has what Intel calls external QPI disabled, but this would cause all sorts of issues.
Wiki article about Intel's QPI and which processors have it enabled or disabled:
http://en.wikipedia.org/wiki/Intel_QuickPath_Interconnect
If I write to the variable in one thread, will other threads see the change?
There is no guarantee. If you think there is, show me where you found it.
What is the order of magnitude of the timeframe in which the other threads will see the change?
It can be never. There is no guarantee.
Do you know of architectures where cache coherency is not enough to insure cross-cpu / cross-core visibility?
This is an incoherent question because you are talking about operations in C++ code that has to be compiled into assembly code. Even if you have hardware guarantees that apply to assembly code, there's no guarantee those guarantees "pass through" to C++ code.
But to the extent the question can be answered, the answer is yes. Posted writes, read prefetching, and other kinds of caching (such as what compilers do with registers) exist in real platforms.
I'd say no, there is no guarantee. There are implementations using multiple, independent computers where shared data has to be transmitted over a (usually very fast) connection between computers. In that situation, you'd try to transmit data only when it is needed. This might be triggered by mutexes, for example, and by the standard atomic functions, but hopefully not by stores into arbitrary local memory, and maybe not by stores into volatile memory.
I may be wrong, but you'd have to prove me wrong.
Assuming nowadays x86/64:
If I write to the variable in one thread, will other threads see the change?
Yes. Assuming you use a modern and not very old / buggy compiler.
What is the order of magnitude of the timeframe in which the other threads will see the change?
It really depends how you measure.
Basically, this would be the memory latency time = 200 cycles on same NUMA node. About double on another node, on a 2-node box. Might differ on bigger boxes.
If your write gets reordered relatively to the point of time measurement, you can get +/-50 cycles.
I measured this a few years back and got 60-70ns on 3GHz boxes and double that on the other node.
Do you know of architectures where cache coherency is not enough to insure cross-cpu / cross-core visibility?
I think the meaning of cache coherency is visibility. Having said that, I'm not sure Sun risk machines have the same cache coherency, and relaxed memory model, as x86, so I'd test very carefully on them. Specifically, you might need to add memory release barriers to force flushing of memory writes.
Given the assumptions you have described, there is no guarantee that a write of a volatile variable in one thread will be "seen" in another.
Given that, your second question (about the timeframe) is not applicable.
With (multi-processor) PowerPC architectures, cache coherency is not sufficient to ensure cross-core visibility of a volatile variable. There are explicit instructions that need to be executed to ensure state is flushed (and to make it visible across multiple processors and their caches).
In practice, on architectures that require such instructions to be executed, the implementation of data synchronisation primitives (mutexes, semaphores, critical sections, etc) does - among other things - use those instructions.
More broadly, the volatile keyword in C++ has nothing to do with multithreading at all, let alone anything to do with cross-cache coherency. volatile, within a given thread of execution, translates to a need for things like fetches and writes of the variable not being eliminated or reordered by the compiler (which affects optimisation). It does not translate into any requirement about ordering or synchronisation of the completion of fetches or writes between threads of execution - and such requirements are necessary for cache coherency.
Notionally, a compiler might be implemented to provide such guarantees. I've yet to see any information about one that does so - which is not surprising, as providing such a guarantee would seriously affect performance of multithreaded code by forcing synchronisation between threads - even if the programmer has not used synchronisation (mutexes, etc) in their code.
Similarly, the host platform could also notionally provide such guarantees with volatile variables - even if the instructions being executed don't specifically require them. Again, that would tend to reduce performance of multithreaded programs - including modern operating systems - on those platforms. It would also affect (or negate) the benefits of various features that contribute to performance of modern processors, such as pipelining, by forcing processors to wait on each other.
If, as a C++ developer (as distinct from someone writing code that exploits specific features offered by your particular compiler or host platform) you want a variable written in one thread able to be coherently read by another thread, then don't bother with volatile. Perform synchronisation between threads - when they need to access the same variable concurrently - using provided techniques - such as mutexes. And follow the usual guidelines on using those techniques (e.g. use mutexes sparingly and minimise the time which they are held, do as much as possible in your threads without accessing variables that are shared between threads at all).

Does pthread_mutex lock provide higher performance than user imposed memory barrier in code

Problem Background
The code in question is related to C++ implementation. We have code base where for certain critical implementation, we do use asm volatile ("mfence":"memory").
My understanding of memory barriers is -
It is used to ensure complete/ordered execution of the instruction set.
It will help avoidance of classical thread synchronization problem - Wiki link.
Question
Is pthread_mutext faster than the memory barrier in case we use memory fence to avoid thread synchronization problem? I have read contents which indicates that pthread mutex uses memory synchronization.
PS :
In our code, the use of asm volatile ("mfence":"memory") is used after a 10-15 lines of c++ code (of member function). So my doubt is - may be a mutext implementation of the memory synchronization gives better performance than that of MB in user implemented code (w.r.t scope of MB).
We are using SUSE Linux 10, 2.6.16.46, smp#1, x64_86 with quad core processor.
pthread mutexes are guaranteed to be slower than a memory fence instruction (I can't say how much slower, that is entirely platform dependent). The reason is simply becuase in order to be compliant posix mutexes, they must include memory guarantees. The posix mutexes have strong memory guarantees, and thus I can't see how they would be implemented without such fences*.
If you're looking for practical advice I use fences in many places instead of mutexes and have timed both of them frequently. pthread_mutexes are very slow on Linux compared to just a raw memory fence (of course, they do a lot more, so be careful what you are actually comparing).
Note however that certain atomic operations, in particular those in C++11, could, and certainly will, be faster then you using fences all over. In this case the compiler/library understands the architecture and need not use the full fence in order to provide the memory guarantees.
Also note, I'm talking about very low-level performance of the lock itself. You need to be profiling to the nanosecond level.
*It is possible to imagine a mutex system which ignores certain types of memory and chooses a more lenient locking implementation (such as relying on ordering guarantees of normal memory and ignored specially marked memory). I would argue such an implementation is however not valid.

How fast is access to atomic variables in C++

My question is how fast is access to atomic variables in C++ by using the C++0x actomic<> class? What goes down at the cache level. Say if one thread is just reading it, would it need to go down to the RAM or it can just read from the cache of the core in which it is executing? Assume the architecture is x86.
I am especially interested in knowing if a thread is just reading from it, while no other thread is writing at that time, would the penalty would be the same as for reading a normal variable. How atomic variables are accessed. Does each read implicity involves a write as well, as in compare-and-swap? Are atomic variables implemented by using compare-and-swap?
If you want raw numbers, Anger Fog's data listings from his optimization manuals should be of use, also, intels manuals have a few section detailing the latencies for memory read/writes on multicore systems, which should include details on the slow-downs caused by bus locking needed for atomic writes.
The answer is not as simple as you perhaps expect. It depends on exact CPU model, and it depends on circumstances as well. The worst case is when you need to perform read-modify-write operation on a variable and there is a conflict (what exactly is a conflict is again CPU model dependent, but most often it is when another CPU is accessing the same cache line).
See also .NET or Windows Synchronization Primitives Performance Specifications
Atomics use special architecture support to get atomicity without forcing all reads/writes to go all the way to main memory. Basically, each core is allowed to probe the caches of other cores, so they find out about the result of other thread's operations that way.
The exact performance depends on the architecture. On x86, MANY operations were already atomic to start with, so they are free. I've seen numbers from anywhere to 10 to 100 cycles, depending on the architecture and operation. For perspective, any read from main memory is 3000-4000 cycles, so the atomics are all MUCH faster than going straight to memory on nearly all platforms.

Compare and swap C++0x

From the C++0x proposal on C++ Atomic Types and Operations:
29.1 Order and Consistency [atomics.order]
Add a new sub-clause with the following paragraphs.
The enumeration memory_order specifies the detailed regular (non-atomic) memory synchronization order as defined in [the new section added by N2334 or its adopted successor] and may provide for operation ordering. Its enumerated values and their meanings are as follows.
memory_order_relaxed
The operation does not order memory.
memory_order_release
Performs a release operation on the affected memory locations, thus making regular memory writes visible to other threads through the atomic variable to which it is applied.
memory_order_acquire
Performs an acquire operation on the affected memory locations, thus making regular memory writes in other threads released through the atomic variable to which it is applied, visible to the current thread.
memory_order_acq_rel
The operation has both acquire and release semantics.
memory_order_seq_cst
The operation has both acquire and release semantics, and in addition, has sequentially-consistent operation ordering.
Lower in the proposal:
bool A::compare_swap( C& expected, C desired,
memory_order success, memory_order failure ) volatile
where one can specify memory order for the CAS.
My understanding is that “memory_order_acq_rel” will only necessarily synchronize those memory locations which are needed for the operation, while other memory locations may remain unsynchronized (it will not behave as a memory fence).
Now, my question is - if I choose “memory_order_acq_rel” and apply compare_swap to integral types, for instance, integers, how is this typically translated into machine code on modern consumer processors such as a multicore Intel i7? What about the other commonly used architectures (x64, SPARC, ppc, arm)?
In particular (assuming a concrete compiler, say gcc):
How to compare-and-swap an integer location with the above operation?
What instruction sequence will such a code produce?
Is the operation lock-free on i7?
Will such an operation run a full cache coherence protocol, synchronizing caches of different processor cores as if it were a memory fence on i7? Or will it just synchronize the memory locations needed by this operation?
Related to previous question - is there any performance advantage to using acq_rel semantics on i7? What about the other architectures?
Thanks for all the answers.
The answer here is not trivial. Exactly what happens and what is meant is dependent on many things. For basic understanding of cache coherence/memory perhaps my recent blog entries might be helpful:
CPU Reordering – What is actually being reordered?
CPU Memory – Why do I need a mutex?
But that aside, let me try to answer a few questions. First off the below function is being very hopeful as to what is supported: very fine-grained control over exactly how strong a memory-order guarantee you get. That's reasonable for compile-time reordering but often not for runtime barriers.
compare_swap( C& expected, C desired,
memory_order success, memory_order failure )
Architectures won't all be able to implement this exactly as you requested; many will have to strengthen it to something strong enough that they can implement. When you specify memory_order you are specifying how reordering may work. To use Intel's terms you will be specifying what type of fence you want, there are three of them, the full fence, load fence, and store fence. (But on x86, load fence and store fence are only useful with weakly-ordered instructions like NT stores; atomics don't use them. Regular load/store give you everything except that stores can appear after later loads.) Just because you want a particular fence on that operation won't mean it is supported, in which I'd hope it always falls back to a full fence. (See Preshing's article on memory barriers)
An x86 (including x64) compiler will likely use the LOCK CMPXCHG instruction to implement the CAS, regardless of memory ordering. This implies a full barrier; x86 doesn't have a way to make a read-modify-write operation atomic without a lock prefix, which is also a full barrier. Pure-store and pure-load can be atomic "on their own", with many ISAs needing barriers for anything above mo_relaxed, but x86 does acq_rel "for free" in asm.
This instruction is lock-free, although all cores trying to CAS the same location will contend for access to it so you could argue it's not really wait-free. (Algorithms that use it might not be lock-free, but the operation itself is wait-free, see wikipedia's non-blocking algorithm article). On non-x86 with LL/SC instead of locked instructions, C++11 compare_exchange_weak is normally wait-free but compare_exchange_strong requires a retry loop in case of spurious failure.
Now that C++11 has existed for years, you can look at the asm output for various architectures on the Godbolt compiler explorer.
In terms of memory sync you need to understand how cache-coherence works (my blog may help a bit). New CPUs use a ccNUMA architecture (previously SMP). Essentially the "view" on the memory never gets out-of-sync. The fences used in the code don't actually force any flushing of cache to happen per-se, only of the store buffer committing in flight stores to cache before later loads.
If two cores both have the same memory location cached in a cache-line, a store by one core will get exclusive ownership of the cache line (invalidating all other copies) and marking its own as dirty. A very simple explanation for a very complex process
To answer your last question you should always use the memory semantics that you logically need to be correct. Most architectures won't support all the combinations you use in your program. However, in many cases you'll get great optimizations, especially in cases where the order you requested is guaranteed without a fence (which is quite common).
-- Answers to some comments:
You have to distinguish between what it means to execute a write instruction and write to a memory location. This is what I attempt to explain in my blog post. By the time the "0" is committed to 0x100, all cores see that zero. Writing integers is also atomic, that is even without a lock, when you write to a location all cores will immediately have that value if they wish to use it.
The trouble is that to use the value you have likely loaded it into a register first, any changes to the location after that obviously won't touch the register. This is why one needs mutexes or atomic<T> despite a cache coherent memory: the compiler is allowed to keep plain variable values in private registers. (In C++11, that's because a data-race on non-atomic variables is Undefined Behaviour.)
As to contradictory claims, generally you'll see all sorts of claims. Whether they are contradictory comes right down to exactly what "see" "load" "execute" mean in the context. If you write "1" to 0x100, does that mean you executed the write instruction or did the CPU actually commit that value. The difference created by the store buffer is one major cause of reordering (the only one x86 allows). The CPU can delay writing the "1", but you can be sure that the moment it does finally commit that "1" all cores see it. The fences control this ordering by making the thread wait until a store commits before doing later operations.
Your whole worldview seems off base: your question insinuates that cache consistency is controlled by memory orders at the C++ level and fences or atomic operations at the CPU level.
But cache consistency is one of the most important invariants for the physical architecture, and it's provided at all time by the memory system that consists of the interconnection of all CPUs and the RAM. You can never beat it from code running on a CPU, or even see its detail of operation. Of course, by observing RAM directly and running code elsewhere you might see stale data at some level of memory: by definition the RAM doesn't have the newest value of all memory locations.
But code running on a CPU can't access DRAM directly, only through the memory hierarchy which includes caches that communicate with each other to maintain coherency of this shared view of memory. (Typically with MESI). Even on a single core, a write-back cache lets DRAM values be stale, which can be an issue for non-cache-coherent DMA but not for reading/writing memory from a CPU.
So the issue exists only for external devices, and only ones that do non-coherent DMA. (DMA is cache-coherent on modern x86 CPUs; the memory controller being built-in to the CPU makes this possible).
Will such an operation run a full cache coherence protocol,
synchronizing caches of different processor cores as if it were a
memory fence on i7?
They are already synchronized. See Does a memory barrier ensure that the cache coherence has been completed? - memory barriers only do local things inside the core running the barrier, like flush the store buffer.
Or will it just synchronize the memory locations
needed by this operation?
An atomic operation applies to exactly one memory location. What others locations do you have in mind?
On a weakly-ordered CPU, a memory_order_relaxed atomic increment could avoid making earlier loads/stores visible before that increment. But x86's strongly-ordered memory model doesn't allow that.