here I am confused with the term memory fence (fence function in rust). I can clearly understand what is memory barrier in terms of atomics but I was unable to figure out what is memory fence.
Are memory fence and memory barriers the same? if not what is the difference and when to use memory fence over memory barrier?

A "fence" in this context is a kind of memory barrier. This distinction is important. For the purposes of this discussion I'll distinguish informally between three kinds of beasts:
Atomic fence: controls the order in which observers can see the effects of atomic memory operations. (This is what you asked about.)
More general memory barrier: controls the order of actual operations against memory or memory-mapped I/O. This is often a bigger hammer that can achieve similar results to an atomic fence, but at higher cost. (Depends on the architecture.)
Compiler fence: controls the order of instructions the processor receives. This is not what you asked about, but people often accidentally use this in place of a real barrier, which makes them sad later.
What fence is
Rust's std::sync::atomic::fence provides an atomic fence operation, which provides synchronization between other atomic fences and atomic memory operations. The terms folks use for describing the various atomic conditions can be a little daunting at first, but they are pretty well defined in the docs, though at the time of this writing there are some omissions. Here are the docs I suggest reading if you want to learn more.
First, Rust's docs for the Ordering type. This is a pretty good description of how operations with different Ordering interact, with less jargon than a lot of references in this area (atomic memory orderings). However, at the time of this writing, it's misleading for your specific question, because it says things like
This ordering is only applicable for operations that can perform a store.
which ignores the existence of fence.
The docs for fence go a little ways to repair that. IMO the docs in this area could use some love.
However, if you want all the interactions precisely laid out, I'm afraid you must look to a different source: the equivalent C++ docs. I know, we're not writing C++, but Rust inherits a lot of this behavior from LLVM, and LLVM tries to follow the C++ standard here. The C++ docs are much higher in jargon, but if you read slowly it's not actually more complex than the Rust docs -- just jargony. The nice thing about the C++ docs is that they discuss each interaction case between load/store/fence and load/store/fence.
What fence is not
The most common place that I employ memory barriers is to reason about completion of writes to memory-mapped I/O in low level code, such as drivers. (This is because I tend to work low in the stack, so this may not apply to your case.) In this case, you are likely performing volatile memory accesses, and you want barriers that are stronger than what fence offers.
In particular, fence helps you reason about which atomic memory operations are visible to which other atomic memory operations -- it does not help you reason about whether a particular stored value has made it all the way through the memory hierarchy and onto a particular level of the bus. For instance. For cases like that, you need a different sort of memory barrier.
These are the sorts of barriers described in considerable detail in the Linux Kernel's documentation on memory barriers.
In response to another answer on this question that flat stated that fence and barrier are equivalent, I raised this case on the Rust Unsafe Code Guidelines issue tracker and got some clarifications.
In particular, you might notice that the docs for Ordering and fence make no mention of how they interact with volatile memory accesses, and that's because they do not. Or at least, they aren't guaranteed to -- on certain architectures the instructions that need to be generated are the same (ARM), and in other cases, they are not (PowerPC).
Rust currently provides a portable atomic fence (which you found), but does not provide portable versions of any other sort of memory barrier, like those provided in the Linux kernel. If you need to reason about the completion of (for example) volatile memory accesses, you will need either non-portable asm! or a function/macro that winds up producing it.
Aside: compiler fences
When I make statements like what I said above, someone inevitably hops in with (GCC syntax)
asm("" :::: memory);
This is neither an atomic fence nor a memory barrier: it is roughly equivalent to Rust's compiler_fence, in that it discourages the compiler from reordering memory accesses across that point in the generated code. It has no effect on the order that the instructions are started or finished by the machine.

There is no difference.
"Fence" and "barrier" mean the same thing in this context.


Why do memory barriers depend upon a variable?

After doing some research regarding the overall purpose of memory barriers/fences, (at least I think that) I have a basic understanding of what they are for. During my research, I focused on the abstraction C++ makes as barriers seem to be hardware-specific and C++ is a low-level, yet universally usable language. However, there's one detail in the semantics of the standard library's abstraction of memory barriers that makes me question my understanding of them.
Regarding the C++ memory barrier memory_order_acq_rel, the documentation states (similar behaviour applies to the other barriers as well):
All writes in other threads that release the same atomic variable are visible before the modification and the modification is visible in other threads that acquire the same atomic variable.
On the processor level (as this restriction wouldn't exist without corresponding hardware restrictions): Why does the specification of a particular variable matter, if all previous changes are affected? For instance, the cache has to be flushed either way, hasn't it? What are the key advantages of this approach?
Thank you in advance.
Using atomic variables as a means to control memory barriers are just one way C++ gives you this control. (Probably the most commonly used way, I might add.)
You don't need to use them though.
You can call functions such as std::atomic_thread_fence and std::atomic_signal_fence which implement the memory barrier semantics without any associated variable.
Generally, you should adhere to C++20 memory model. Invidia developers found a bug in the former model (composed a fully legal C++ code that follows standart rules but results in UB - data racing - due issues in the memory model) and I heard there were some other issues as well. Furthermore, C++ strives to be a general language that can function for wide spectrum of devices, thus some rules might be meaningless for certain devices and extremely important for other ones.
I am unsure about implementation details and what processor actually needs to do. However, besides processor actions on the atomic variable, it also informs the compiler about allowed and forbidden optimizations. E.g., local variables that are logically non-accessible from other threads never need to be reloaded into cache regardless of actions performed on atomic variables.

How does a mutex ensure a variable's value is consistent across cores?

If I have a single int which I want to write to from one thread and read from on another, I need to use std::atomic, to ensure that its value is consistent across cores, regardless of whether or not the instructions that read from and write to it are conceptually atomic. If I don't, it may be that the reading core has an old value in its cache, and will not see the new value. This makes sense to me.
If I have some complex data type that cannot be read/written to atomically, I need to guard access to it using some synchronisation primitive, such as std::mutex. This will prevent the object getting into (or being read from) an inconsistent state. This makes sense to me.
What doesn't make sense to me is how mutexes help with the caching problem that atomics solve. They seem to exist solely to prevent concurrent access to some resource, but not to propagate any values contained within that resource to other cores' caches. Is there some part of their semantics I've missed which deals with this?
The right answer to this is magic pixies - e.g. It Just Works. The implementation of std::atomic for each platform must do the right thing.
The right thing is a combination of 3 parts.
Firstly, the compiler needs to know that it can't move instructions across boundaries [in fact it can in some cases, but assume that it doesn't].
Secondly, the cache/memory subsystem needs to know - this is generally done using memory barriers, although x86/x64 generally have such strong memory guarantees that this isn't necessary in the vast majority of cases (which is a big shame as its nice for wrong code to actually go wrong).
Finally the CPU needs to know it cannot reorder instructions. Modern CPUs are massively aggressive at reordering operations and making sure in the single threaded case that this is unnoticeable. They may need more hints that this cannot happen in certain places.
For most CPUs part 2 and 3 come down to the same thing - a memory barrier implies both. Part 1 is totally inside the compiler, and is down to the compiler writers to get right.
See Herb Sutters talk 'Atomic Weapons' for a lot more interesting info.
The consistency across cores is ensured by memory barriers (which also prevents instructions reordering). When you use std::atomic, not only do you access the data atomically, but the compiler (and library) also insert the relevant memory barriers.
Mutexes work the same way: the mutex implementations (eg. pthreads or WinAPI or what not) internally also insert memory barriers.
Most modern multicore processors (including x86 and x64) are cache coherent. If two cores hold the same memory location in cache and one of them updates the value, the change is automatically propagated to other cores' caches. It's inefficient (writing to the same cache line at the same time from two cores is really slow) but without cache coherence it would be very difficult to write multithreaded software.
And like syam said, memory barriers are also required. They prevent the compiler or processor from reordering memory accesses, and also force the write into memory (or at least into cache), when for example a variable is held in a register because of compiler optizations.

Does pthread_mutex lock provide higher performance than user imposed memory barrier in code

Problem Background
The code in question is related to C++ implementation. We have code base where for certain critical implementation, we do use asm volatile ("mfence":"memory").
My understanding of memory barriers is -
It is used to ensure complete/ordered execution of the instruction set.
It will help avoidance of classical thread synchronization problem - Wiki link.
Is pthread_mutext faster than the memory barrier in case we use memory fence to avoid thread synchronization problem? I have read contents which indicates that pthread mutex uses memory synchronization.
PS :
In our code, the use of asm volatile ("mfence":"memory") is used after a 10-15 lines of c++ code (of member function). So my doubt is - may be a mutext implementation of the memory synchronization gives better performance than that of MB in user implemented code (w.r.t scope of MB).
We are using SUSE Linux 10,, smp#1, x64_86 with quad core processor.
pthread mutexes are guaranteed to be slower than a memory fence instruction (I can't say how much slower, that is entirely platform dependent). The reason is simply becuase in order to be compliant posix mutexes, they must include memory guarantees. The posix mutexes have strong memory guarantees, and thus I can't see how they would be implemented without such fences*.
If you're looking for practical advice I use fences in many places instead of mutexes and have timed both of them frequently. pthread_mutexes are very slow on Linux compared to just a raw memory fence (of course, they do a lot more, so be careful what you are actually comparing).
Note however that certain atomic operations, in particular those in C++11, could, and certainly will, be faster then you using fences all over. In this case the compiler/library understands the architecture and need not use the full fence in order to provide the memory guarantees.
Also note, I'm talking about very low-level performance of the lock itself. You need to be profiling to the nanosecond level.
*It is possible to imagine a mutex system which ignores certain types of memory and chooses a more lenient locking implementation (such as relying on ordering guarantees of normal memory and ignored specially marked memory). I would argue such an implementation is however not valid.

Overhead of a Memory Barrier / Fence

I'm currently writing C++ code and use a lot of memory barriers / fences in my code. I know, that a MB tolds the compiler and the hardware to not reorder write/reads around it. But i don't know how complex this operation is for the processor at runtime.
My Question is: What is the runtime-overhead of such a barrier? I didn't found any useful answer with google...
Is the overhead negligible? Or leads heavy usage of MBs to serious performance problems?
Best regards.
Compared to arithmetic and "normal" instructions I understand these to be very costly, but do not have numbers to back up that statement. I like jalf's answer by describing effects of the instructions, and would like to add a bit.
There are in general a few different types of barriers, so understanding the differences could be helpful. A barrier like the one that jalf mentioned is required for example in a mutex implementation before clearing the lock word (lwsync on ppc, or st4.rel on ia64 for example). All reads and writes must be complete, and only instructions later in the pipeline that have no memory access and no dependencies on in progress memory operations can be executed.
Another type of barrier is the sort that you'd use in a mutex implementation when acquiring a lock (examples, isync on ppc, or instr.acq on ia64). This has an effect on future instructions, so if a non-dependent load has been prefetched it must be discarded. Example:
if ( pSharedMem->atomic.bit_is_set() ) // use a bit to flag that somethingElse is "ready"
foo( pSharedMem->somethingElse ) ;
Without an acquire barrier (borrowing ia64 lingo), your program may have unexpected results if somethingElse made it into a register before the check of the flagging bit check is complete.
There is a third type of barrier, generally less used, and is required to enforce store load ordering. Examples of instructions for such an ordering enforcing instruction are, sync on ppc (heavyweight sync), MF on ia64, membar #storeload on sparc (required even for TSO).
Using ia64 like pseudocode to illustrate, suppose one had
without an mf in between one has no guarentee that the load follows the store. You know that loads and stores preceding the st4.rel are done before that store or the "subsequent" load, but that load or other future loads (and perhaps stores if non-dependent?) could sneak in, completing earlier since nothing prevents that otherwise.
Because mutex implementations very likely only use acquire and release barriers in thier implementations, I'd expect that an observable effect of this is that memory access following lock release may actually sometimes occur while "still in the critical section".
Try thinking about what the instruction does. It doesn't make the CPU do anything complicated in terms of logic, but it forces it to wait until all reads and writes have been committed to main memory. So the cost really depends on the cost of accessing main memory (and the number of outstanding reads/writes).
Accessing main memory is generally pretty expensive (10-200 clock cycles), but in a sense, that work would have to be done without the barrier as well, it could just be hidden by executing some other instructions simultaneously so you didn't feel the cost so much.
It also limits the CPU's (and compilers) ability to reschedule instructions, so there may be an indirect cost as well in that nearby instructions can't be interleaved which might otherwise yield a more efficient execution schedule.

What is a memory fence?

What is meant by using an explicit memory fence?
For performance gains modern CPUs often execute instructions out of order to make maximum use of the available silicon (including memory read/writes). Because the hardware enforces instructions integrity you never notice this in a single thread of execution. However for multiple threads or environments with volatile memory (memory mapped I/O for example) this can lead to unpredictable behavior.
A memory fence/barrier is a class of instructions that mean memory read/writes occur in the order you expect. For example a 'full fence' means all read/writes before the fence are comitted before those after the fence.
Note memory fences are a hardware concept. In higher level languages we are used to dealing with mutexes and semaphores - these may well be implemented using memory fences at the low level and explicit use of memory barriers are not necessary. Use of memory barriers requires a careful study of the hardware architecture and more commonly found in device drivers than application code.
The CPU reordering is different from compiler optimisations - although the artefacts can be similar. You need to take separate measures to stop the compiler reordering your instructions if that may cause undesirable behaviour (e.g. use of the volatile keyword in C).
Copying my answer to another question, What are some tricks that a processor does to optimize code?:
The most important one would be memory access reordering.
Absent memory fences or serializing instructions, the processor is free to reorder memory accesses. Some processor architectures have restrictions on how much they can reorder; Alpha is known for being the weakest (i.e., the one which can reorder the most).
A very good treatment of the subject can be found in the Linux kernel source documentation, at Documentation/memory-barriers.txt.
Most of the time, it's best to use locking primitives from your compiler or standard library; these are well tested, should have all the necessary memory barriers in place, and are probably quite optimized (optimizing locking primitives is tricky; even the experts can get them wrong sometimes).
In my experience it refers to a memory barrier, which is an instruction (explicit or implicit) to synchronize memory access between multiple threads.
The problem occurs in the combination of modern agressive compilers (they have amazing freedom to reorder instructions, but usually know nothing of your threads) and modern multicore CPUs.
A good introduction to the problem is the "The 'Double-Checked Locking is Broken' Declaration". For many, it was the wake-up call that there be dragons.
Implicit full memory barriers are usually included in platform thread synchronization routines, which cover the core of it. However, for lock-free programming and implementing custom, lightweight synchronization patterns, you often need just the barrier, or even a one-way barrier only.
Wikipedia knows all...
Memory barrier, also known as membar
or memory fence, is a class of
instructions which cause a central
processing unit (CPU) to enforce an
ordering constraint on memory
operations issued before and after the
barrier instruction.
CPUs employ performance optimizations
that can result in out-of-order
execution, including memory load and
store operations. Memory operation
reordering normally goes unnoticed
within a single thread of execution,
but causes unpredictable behaviour in
concurrent programs and device drivers
unless carefully controlled. The exact
nature of an ordering constraint is
hardware dependent, and defined by the
architecture's memory model. Some
architectures provide multiple
barriers for enforcing different
ordering constraints.
Memory barriers are typically used
when implementing low-level machine
code that operates on memory shared by
multiple devices. Such code includes
synchronization primitives and
lock-free data structures on
multiprocessor systems, and device
drivers that communicate with computer
memory fence(memory barrier) is a kind of lock-free mechanism for synchronisation multiple threads. In a single thread envirompment reordering is safe.
The problem is ordering, shared resource and caching. Processor or compiler is able to reorder a program instruction(programmer order) for optimisation. It creates side effects in multithread envirompment. That is why memory barrier was introduce to guarantee that program will work properly. It is slower but it fixes this type of issue
