Is memory barrier related to some specific memory location? - concurrency

I'm trying to learn the basics about low-level concurrency.
From Linux documentation:
A write memory barrier gives a guarantee that all the STORE operations
specified before the barrier will appear to happen before all the STORE
operations specified after the barrier with respect to the other
components of the system.
I think that "all the STORE operations" must mean that there are more instances of a particular barrier type than one and there is probably 1:N relationship between a barrier instance and a STORE. Where can I find confirmation for this?

Memory barriers are not related to any specific memory locations.
It's not about "write to memory address x should happen before write to address y", it's about execution order of instructions, e.g. for program
x = 2
y = 1
processor may decide: "I don't want to wait until 2 will be finally stored in x, I can start writing 1 to y while x = 2 is still in progress" (also known as out-of-order execution/reordering), so reader on another core may observe 0 in x (its initial value) after observing 1 in y which is counter-intuitive behaviour.
If you place write barrier between two stores, then reader can be sure that if he observes results of second store, the first one is also happened, so if he reads y == 1, then it's known that x == 2 (It's not that easy though, because reads can be executed out of order too, so you need read barrier). In other words, such barrier forbids executing y = 1 while x = 2 is not finished.
As #RafaelWinterhalter mentioned, there is awesome guide for JVM compiler writers, which has many concrete examples how barriers are mapped to real code.
As additional reading see Preshing blog, it has many articles about low level concurrency, e.g. this one about barriers.

Related

What exactly is the problem that memory barriers deal with?

I'm trying to wrap my head around the issue of memory barriers right now. I've been reading and watching videos about the subject, and I want to make sure I understand it correctly, as well as ask a question or two.
I start with understanding the problem accurately. Let's take the following classic example as the basis for the discussion: Suppose we have 2 threads running on 2 different cores
This is pseudo-code!
We start with int f = 0; int x = 0; and then run those threads:
# Thread 1
while(f == 0);
print(x)
# Thread 2
x = 42;
f = 1;
Of course, the desired result of this program is that thread 1 will print 42.
NOTE: I leave "compile-time reordering" out of this discussion, I only want to focus on what happens in runtime, so ignore all kinds of optimizations that the compiler might do.
Ok so from what I understand the problem here is what is called "memory reordering": the CPU is free to reorder memory operations as long as the end result is what the program expects it to be. In this case, within thread 2, the f = 1 may be executed before x = 42. In this case, thread 1 will print 0, which is not what the programmer want.
At this point, Wikipedia points at another possible scenario that may occur:
Similarly, thread #1's load operations may be executed out-of-order and it is possible for x to be read before f is checked
Since we're talking right now about "out of order execution" - let's ignore for a moment from the caches of the cores. So let's analyze what happens here. Start with thread 2 - the compiled instructions will look (in pseudo-assembly) something like:
1 put 42 into register1
2 write register1 to memory location of x
3 put 1 into register 2
4 write register2 to memory location of f
Ok so I understand that 3-4 may be executed before 1-2. But I don't understand the equivalent in thread 1:
Let's say the instructions of thread 1 will be something like:
1 load f to register1
2 if f is 0 - jump to 1
3 load x to register2
4 print register2
What exactly may be out of order here? 3 can be before 1-2?
Let's go on: Up until now we talked about out-of-order execution, which brings me to my primary confusion:
In this great post the author describes the problem as such: Each core has its own cache, and the core does the memory operations against the cache, not against the main memory. The movement of memory from the core-specific caches to the main memory (or a shared cache) occurs in unpredictable time and order. So in our example - even if thread 2 will execute its instructions in-order - the writing of x=42 will occur before f=1, but that will be only to the cache of core2. The movement of these values to a shared memory may be in the opposite order, and hence the problem.
So I don't understand - when we talk about "memory reordering" - do we talk about Out-of-order execution, or are we talking about the movement of data across caches?
when we talk about "memory reordering" - do we talk about Out-of-order execution, or are we talking about the movement of data across caches?
When a thread observes changes of values in a particular order, then from the programmer's perspective it is indistinguishable whether that was due to out-of-order execution of loads, a store buffer delaying stores relative to loads and possibly letting them commit out of order (regardless of execution order), or (hypothetically in a CPU without coherent cache) cache synchronization.
Or even by forwarding store data between logical cores without going through cache, before it commits to cache and becomes visible to all cores. Some POWER CPUs can do this in real life but few if any others.
Real CPUs have coherent caches; once a value commits to cache, it's visible to all cores; it can't happen until other copies are already invalidated, so this is not the mechanism for reading "stale" data. Memory reordering on real-world CPUs is something that happens within a core, with reads and writes of coherent cache possibly happening in a different order than program order. Cache doesn't re-sync after getting out of sync; it maintains coherency in the first place.
The important effect, regardless of mechanism, is that another thread observing the same variables you're reading/writing, can see the effects happen in a different order than assembly program-order.
The two mail questions you have both have the same answer (Yes!), but for different reasons.
First let's look at this particular piece of pseudo-machine-code
Let's say the instructions of thread 1 will be something like:
1 load f to register1
2 if f is 0 - jump to 1
3 load x to register2
4 print register2
What exactly may be out of order here? 3 can be before 1-2?
To answer your question, this is a reverberating "YES!". Since the contents of register1 are not tied in any way to the contents of register2 the CPU may happily (and correctly, for that matter) preload register2, so that when the the 1,2 loop finally breaks, it can immediately go to 4.
For a practical example, register1 might be an I/O peripheral register tied to a polled serial clock, and the CPU is just waiting for the clock to transition to low, so that it can bit-bang the next value onto the data output lines. Doing it that way for one saves precious time on the data fetch and more importantly may avoid contention on the peripheral data bus.
So, yes, this kind of reordering is perfectly fine and allowed, even with optimizations turned off and happening on a single threaded, single core CPU. The only way to make sure that register2 is definitely read, after the loop breaks, is to insert a barrier.
The second question is about cache coherence. And again, the answer to the need of memory barriers is "yes! you need them". Cache coherence is an issue, because modern CPUs don't talk to the system memory directly, but through their caches. As long as you're dealing with only a single CPU core, and a single cache, coherence is not an issue, since all the threads running on the same core do work against the same cache. However the moment you have multiple cores with independent caches, their individual views of the system memory contents may differ, and some form of memory consistency model is required. Either through explicit insertion of memory barriers, or on the hardware level.
For my point of view you missed the most important thing!
As the compiler did not see that the change of x nor f has any side effect, the compiler also can optimize all of that away. And also the loop with condition f==0 will result in "nothing" as the compiler only sees that you propagate a constant for f=0 before, it can assume that f==0 will always be true and optimize it away.
And for all of that you have to tell the compiler that there will be something happen which is not visible from the given flow of code. That can be something like a call to some semaphore/mutex/... or other IPC functionality or the use of atomic vars.
If you compile your code, I assume you get more or less "nothing" as for each of both code parts nothing has any effect and the compiler did not see that the variables are used from two thread context and optimize all and everything away.
If we implement the code as the following example, we see it fails and print 0 on my system.
int main()
{
int f = 0;
int x = 0;
std::thread s( [&f,&x](){ x=42; f=1; } );
while( f==0);
std::cout << x << std::endl;
s.join();
}
and if we change int f = 0; to std::atomic<int> f = 0 we get the expected result.

Do we need to use lock for multi-threaded x32 system for just reading or writing into a uint32_t variable

I have a question:
Consider a x32 System,
therefore for a uint32_t variable does the system read and write to it atomically?
Meaning, that the entire operation of read or write can be completed in one instruction cycle.
If this is the case then for a multi-threaded x32 system we wont have to use locks for just reading or writing into a uint32_t variable.
Please confirm my understanding.
It is only atomic if you write the code in assembler and pick the appropriate instruction. When using a higher level language, you don't have any control over which instructions that will get picked.
If you have some C code like a = b; then the machine code generated might be "load b into register x", "store register x in the memory location of a", which is more than one instruction. An interrupt or another thread executed between those two will mean data corruption if it uses the same variable. Suppose the other thread writes a different value to a - then that change will be lost when returning to the original thread.
Therefore you must use some manner of protection mechanism, such as _Atomic qualifier, mutex or critical sections.
Yes, one needs to use locks or some other appropriate mechanism, like the atomics.
C11 5.1.2.4p4:
Two expression evaluations conflict if one of them modifies a memory location and the other one reads or modifies the same memory location.
C11 5.1.2.4p25:
The execution of a program contains a data race if it contains two conflicting actions in different threads, at least one of which is not atomic, and neither happens before the other. Any such data race results in undefined behavior.
Additionally if you've got a variable that is not volatile-qualified then the C standard does not even require that the changes hit to the memory at all; unless you use some synchronization mechanism the data races could have much longer spans in an optimized program than one would perhaps initially think could be possible - for example the writes can be totally out of order and so forth.
The usage of locks is not (only) to ensure atomicity, 32-bit variables are already been written atomically.
Your problem is to protect simultaneous writing:
int x = 0;
Function 1: x++;
Function 2: x++;
If there is no synchronization, x might up end as 1 instead of 2 because function 2 might be reading x = 0, before function 1 modifies x. The worst thing in all this is that it might happen or not at random (or at your client's PC), so debugging is difficult.
The issue is that variables aren't updated instantly.
Each processor's core has its own private memory (L1 and L2 caches). So if you modify a variable, say x++, in two different threads on two different cores - then each core updates their own version of x.
Atomic operations and mutexes ensure synchronization of these variables with the shared memory (RAM / L3 cache).

confused about atomic class: memory_order_relaxed

I am studying this site: https://gcc.gnu.org/wiki/Atomic/GCCMM/AtomicSync, which is very helpful to understand the topic about atomic class.
But this example about relaxed mode is hard to understand:
/*Thread 1:*/
y.store (20, memory_order_relaxed)
x.store (10, memory_order_relaxed)
/*Thread 2*/
if (x.load (memory_order_relaxed) == 10)
{
assert (y.load(memory_order_relaxed) == 20) /* assert A */
y.store (10, memory_order_relaxed)
}
/*Thread 3*/
if (y.load (memory_order_relaxed) == 10)
assert (x.load(memory_order_relaxed) == 10) /* assert B */
To me assert B should never fail, since x must be 10 and y=10 because of thread 2 has conditioned on this.
But the website says either assert in this example can actually FAIL.
To me assert B should never fail, since x must be 10 and y=10 because of thread 2 has conditioned on this.
In effect, your argument is that since in thread 2 the store of 10 into x occurred before the store of 10 into y, in thread 3 the same must be the case.
However, since you are only using relaxed memory operations, there is nothing in the code that requires two different threads to agree on the ordering between modifications of different variables. So indeed thread 2 may see the store of 10 into x before the store of 10 into y while thread 3 sees those two operations in the opposite order.
In order to ensure that assert B succeeds, you would, in effect, need to ensure that when thread 3 sees the value 10 of y, it also sees any other side effects performed by the thread that stored 10 into y before the time of the store. That is, you need the store of 10 into y to synchronize with the load of 10 from y. This can be done by having the store perform a release and the load perform an acquire:
// thread 2
y.store (10, memory_order_release);
// thread 3
if (y.load (memory_order_acquire) == 10)
A release operation synchronizes with an acquire operation that reads the value stored. Now because the store in thread 2 synchronizes with the load in thread 3, anything that happens after the load in thread 3 will see the side effects of anything that happens before the store in thread 2. Hence the assertion will succeed.
Of course, we also need to make sure assertion A succeeds, by making the x.store in thread 1 use release and the x.load in thread 2 use acquire.
I find it much easier to understand atomics with some knowledge of what might be causing them, so here's some background knowledge. Know that these concepts are in no way stated in the C++ language itself, but is some of the possible reasons why things are the way they are.
Compiler reordering
Compilers, often when optimizing, will choose to refactor the program as long as its effects are the same on a single threaded program. This is circumvented with the use of atomics, which will tell the compiler (among other things) that the variable might change at any moment, and that its value might be read elsewhere.
Formally, atomics ensures one thing: there will be no data races. That is, accessing the variable will not make your computer explode.
CPU reordering
CPU might reorder instructions as it is executing them, which means the instructions might get reordered on the hardware level, independent of how you wrote the program.
Caching
Finally there are effects of caches, which are faster memory that sorta contains a partial copy of the global memory. Caches are not always in sync, meaning they don't always agree on what is "correct". Different threads may not be using the same cache, and due to this, they may not agree on what values variables have.
Back to the problem
What the above surmounts to is pretty much what C++ says about the matter: unless explicitly stated otherwise, the ordering of side effects of each instruction is totally and completely unspecified. It might not even be the same viewed from different threads.
Formally, the guarantee of an ordering of side effects is called a happens-before relation. Unless a side effect happens-before another, it is not. Loosely, we just say call it synchronization.
Now, what is memory_order_relaxed? It is telling the compiler to stop meddling, but don't worry about how the CPU and cache (and possibly other things) behave. Therefore, one possibility of why you see the "impossible" assert might be
Thread 1 stores 20 to y and then 10 to x to its cache.
Thread 2 reads the new values and stores 10 to y to its cache.
Thread 3 didn't read the values from thread 1, but reads those of thread 2, and then the assert fails.
This might be completely different from what happens in reality, the point is anything can happen.
To ensure a happens-before relation between the multiple reads and writes, see Brian's answer.
Another construct that provides happens-before relations is std::mutex, which is why they are free from such insanities.
The answer to your question is the C++ standard.
The section [intro.races] is surprisingly very clear (which is not the rule of normative text kind: formalism consistency oftenly hurts readability).
I have read many books and tuto which treats the subject of memory order, but it just confused me.
Finally I have read the C++ standard, the section [intro.multithread] is the clearest I have found. Taking the time to read it carefully (twice) may save you some time!
The answer to your question is in [intro.races]/4:
All modifications to a particular atomic object M occur in some particular total order, called the modification
order of M. [ Note: There is a separate order for each atomic object. There is no requirement that these can
be combined into a single total order for all objects. In general this will be impossible since different threads
may observe modifications to different objects in inconsistent orders. — end note ]
You were expecting a single total order on all atomic operations. There is such an order, but only for atomic operations that are memory_order_seq_cst as explained in [atomics.order]/3:
There shall be a single total order S on all memory_order_seq_cst operations, consistent with the “happens
before” order and modification orders for all affected locations [...]

Why there is no data race?

I am reading Bjarne's FAQ on Memory Model, here is a quote
So, C++11 guarantees that no such problems occur for "separate memory locations.'' More precisely: A memory location cannot be safely accessed by two threads without some form of locking unless they are both read accesses. Note that different bitfields within a single word are not separate memory locations, so don't share structs with bitfields among threads without some form of locking. Apart from that caveat, the C++ memory model is simply "as everyone would expect.''
However, it is not always easy to think straight about low-level concurrency issues. Consider:
start with x==0 and y==0
if (x) y = 1; // Thread 1
if (y) x = 1; // Thread 2
Is there a problem here? More precisely, is there a data race? (No there isn't).
My question is, why there is no data race? It is obvious to me that there apparently is a data race since thread 1 is a writer for y while thread 2 is a reader for y, and similarly for x.
x and y are 0 and therefore the code behind the if will not be executed and there will be no write and therefore there can be no data race.
The critical point is:
start with x==0 and y==0
Since both variables are set to 0 when it starts, the if tests will fail, and assignments will never occur. So both threads are only reading the variables, never writing them.

Acquire/Release versus Sequentially Consistent memory order

For any std::atomic<T> where T is a primitive type:
If I use std::memory_order_acq_rel for fetch_xxx operations, and std::memory_order_acquire for load operation and std::memory_order_release for store operation blindly (I mean just like resetting the default memory ordering of those functions)
Will the results be same as if I used std::memory_order_seq_cst (which is being used as default) for any of the declared operations?
If the results were the same, is this usage anyhow different than using std::memory_order_seq_cst in terms of efficiency?
The C++11 memory ordering parameters for atomic operations specify constraints on the ordering. If you do a store with std::memory_order_release, and a load from another thread reads the value with std::memory_order_acquire then subsequent read operations from the second thread will see any values stored to any memory location by the first thread that were prior to the store-release, or a later store to any of those memory locations.
If both the store and subsequent load are std::memory_order_seq_cst then the relationship between these two threads is the same. You need more threads to see the difference.
e.g. std::atomic<int> variables x and y, both initially 0.
Thread 1:
x.store(1,std::memory_order_release);
Thread 2:
y.store(1,std::memory_order_release);
Thread 3:
int a=x.load(std::memory_order_acquire); // x before y
int b=y.load(std::memory_order_acquire);
Thread 4:
int c=y.load(std::memory_order_acquire); // y before x
int d=x.load(std::memory_order_acquire);
As written, there is no relationship between the stores to x and y, so it is quite possible to see a==1, b==0 in thread 3, and c==1 and d==0 in thread 4.
If all the memory orderings are changed to std::memory_order_seq_cst then this enforces an ordering between the stores to x and y. Consequently, if thread 3 sees a==1 and b==0 then that means the store to x must be before the store to y, so if thread 4 sees c==1, meaning the store to y has completed, then the store to x must also have completed, so we must have d==1.
In practice, then using std::memory_order_seq_cst everywhere will add additional overhead to either loads or stores or both, depending on your compiler and processor architecture. e.g. a common technique for x86 processors is to use XCHG instructions rather than MOV instructions for std::memory_order_seq_cst stores, in order to provide the necessary ordering guarantees, whereas for std::memory_order_release a plain MOV will suffice. On systems with more relaxed memory architectures the overhead may be greater, since plain loads and stores have fewer guarantees.
Memory ordering is hard. I devoted almost an entire chapter to it in my book.
Memory ordering can be quite tricky, and the effects of getting it wrong is often very subtle.
The key point with all memory ordering is that it guarantees what "HAS HAPPENED", not what is going to happen. For example, if you store something to a couple of variables (e.g. x = 7; y = 11;), then another processor may be able to see y as 11 before it sees the value 7 in x. By using memory ordering operation between setting x and setting y, the processor that you are using will guarantee that x = 7; has been written to memory before it continues to store something in y.
Most of the time, it's not REALLY important which order your writes happen, as long as the value is updated eventually. But if we, say, have a circular buffer with integers, and we do something like:
buffer[index] = 32;
index = (index + 1) % buffersize;
and some other thread is using index to determine that the new value has been written, then we NEED to have 32 written FIRST, then index updated AFTER. Otherwise, the other thread may get old data.
The same applies to making semaphores, mutexes and such things work - this is why the terms release and acquire are used for the memory barrier types.
Now, the cst is the most strict ordering rule - it enforces that both reads and writes of the data you've written goes out to memory before the processor can continue to do more operations. This will be slower than doing the specific acquire or release barriers. It forces the processor to make sure stores AND loads have been completed, as opposed to just stores or just loads.
How much difference does that make? It is highly dependent on what the system archiecture is. On some systems, the cache needs to flushed [partially] and interrupts sent from one core to another to say "Please do this cache-flushing work before you continue" - this can take several hundred cycles. On other processors, it's only some small percentage slower than doing a regular memory write. X86 is pretty good at doing this fast. Some types of embedded processors, (some models of - not sure?)ARM for example, require a bit more work in the processor to ensure everything works.