I am reading Bjarne's FAQ on Memory Model, here is a quote
So, C++11 guarantees that no such problems occur for "separate memory locations.'' More precisely: A memory location cannot be safely accessed by two threads without some form of locking unless they are both read accesses. Note that different bitfields within a single word are not separate memory locations, so don't share structs with bitfields among threads without some form of locking. Apart from that caveat, the C++ memory model is simply "as everyone would expect.''
However, it is not always easy to think straight about low-level concurrency issues. Consider:
start with x==0 and y==0
if (x) y = 1; // Thread 1
if (y) x = 1; // Thread 2
Is there a problem here? More precisely, is there a data race? (No there isn't).
My question is, why there is no data race? It is obvious to me that there apparently is a data race since thread 1 is a writer for y while thread 2 is a reader for y, and similarly for x.
x and y are 0 and therefore the code behind the if will not be executed and there will be no write and therefore there can be no data race.
The critical point is:
start with x==0 and y==0
Since both variables are set to 0 when it starts, the if tests will fail, and assignments will never occur. So both threads are only reading the variables, never writing them.
Related
I have a question:
Consider a x32 System,
therefore for a uint32_t variable does the system read and write to it atomically?
Meaning, that the entire operation of read or write can be completed in one instruction cycle.
If this is the case then for a multi-threaded x32 system we wont have to use locks for just reading or writing into a uint32_t variable.
Please confirm my understanding.
It is only atomic if you write the code in assembler and pick the appropriate instruction. When using a higher level language, you don't have any control over which instructions that will get picked.
If you have some C code like a = b; then the machine code generated might be "load b into register x", "store register x in the memory location of a", which is more than one instruction. An interrupt or another thread executed between those two will mean data corruption if it uses the same variable. Suppose the other thread writes a different value to a - then that change will be lost when returning to the original thread.
Therefore you must use some manner of protection mechanism, such as _Atomic qualifier, mutex or critical sections.
Yes, one needs to use locks or some other appropriate mechanism, like the atomics.
C11 5.1.2.4p4:
Two expression evaluations conflict if one of them modifies a memory location and the other one reads or modifies the same memory location.
C11 5.1.2.4p25:
The execution of a program contains a data race if it contains two conflicting actions in different threads, at least one of which is not atomic, and neither happens before the other. Any such data race results in undefined behavior.
Additionally if you've got a variable that is not volatile-qualified then the C standard does not even require that the changes hit to the memory at all; unless you use some synchronization mechanism the data races could have much longer spans in an optimized program than one would perhaps initially think could be possible - for example the writes can be totally out of order and so forth.
The usage of locks is not (only) to ensure atomicity, 32-bit variables are already been written atomically.
Your problem is to protect simultaneous writing:
int x = 0;
Function 1: x++;
Function 2: x++;
If there is no synchronization, x might up end as 1 instead of 2 because function 2 might be reading x = 0, before function 1 modifies x. The worst thing in all this is that it might happen or not at random (or at your client's PC), so debugging is difficult.
The issue is that variables aren't updated instantly.
Each processor's core has its own private memory (L1 and L2 caches). So if you modify a variable, say x++, in two different threads on two different cores - then each core updates their own version of x.
Atomic operations and mutexes ensure synchronization of these variables with the shared memory (RAM / L3 cache).
I am studying this site: https://gcc.gnu.org/wiki/Atomic/GCCMM/AtomicSync, which is very helpful to understand the topic about atomic class.
But this example about relaxed mode is hard to understand:
/*Thread 1:*/
y.store (20, memory_order_relaxed)
x.store (10, memory_order_relaxed)
/*Thread 2*/
if (x.load (memory_order_relaxed) == 10)
{
assert (y.load(memory_order_relaxed) == 20) /* assert A */
y.store (10, memory_order_relaxed)
}
/*Thread 3*/
if (y.load (memory_order_relaxed) == 10)
assert (x.load(memory_order_relaxed) == 10) /* assert B */
To me assert B should never fail, since x must be 10 and y=10 because of thread 2 has conditioned on this.
But the website says either assert in this example can actually FAIL.
To me assert B should never fail, since x must be 10 and y=10 because of thread 2 has conditioned on this.
In effect, your argument is that since in thread 2 the store of 10 into x occurred before the store of 10 into y, in thread 3 the same must be the case.
However, since you are only using relaxed memory operations, there is nothing in the code that requires two different threads to agree on the ordering between modifications of different variables. So indeed thread 2 may see the store of 10 into x before the store of 10 into y while thread 3 sees those two operations in the opposite order.
In order to ensure that assert B succeeds, you would, in effect, need to ensure that when thread 3 sees the value 10 of y, it also sees any other side effects performed by the thread that stored 10 into y before the time of the store. That is, you need the store of 10 into y to synchronize with the load of 10 from y. This can be done by having the store perform a release and the load perform an acquire:
// thread 2
y.store (10, memory_order_release);
// thread 3
if (y.load (memory_order_acquire) == 10)
A release operation synchronizes with an acquire operation that reads the value stored. Now because the store in thread 2 synchronizes with the load in thread 3, anything that happens after the load in thread 3 will see the side effects of anything that happens before the store in thread 2. Hence the assertion will succeed.
Of course, we also need to make sure assertion A succeeds, by making the x.store in thread 1 use release and the x.load in thread 2 use acquire.
I find it much easier to understand atomics with some knowledge of what might be causing them, so here's some background knowledge. Know that these concepts are in no way stated in the C++ language itself, but is some of the possible reasons why things are the way they are.
Compiler reordering
Compilers, often when optimizing, will choose to refactor the program as long as its effects are the same on a single threaded program. This is circumvented with the use of atomics, which will tell the compiler (among other things) that the variable might change at any moment, and that its value might be read elsewhere.
Formally, atomics ensures one thing: there will be no data races. That is, accessing the variable will not make your computer explode.
CPU reordering
CPU might reorder instructions as it is executing them, which means the instructions might get reordered on the hardware level, independent of how you wrote the program.
Caching
Finally there are effects of caches, which are faster memory that sorta contains a partial copy of the global memory. Caches are not always in sync, meaning they don't always agree on what is "correct". Different threads may not be using the same cache, and due to this, they may not agree on what values variables have.
Back to the problem
What the above surmounts to is pretty much what C++ says about the matter: unless explicitly stated otherwise, the ordering of side effects of each instruction is totally and completely unspecified. It might not even be the same viewed from different threads.
Formally, the guarantee of an ordering of side effects is called a happens-before relation. Unless a side effect happens-before another, it is not. Loosely, we just say call it synchronization.
Now, what is memory_order_relaxed? It is telling the compiler to stop meddling, but don't worry about how the CPU and cache (and possibly other things) behave. Therefore, one possibility of why you see the "impossible" assert might be
Thread 1 stores 20 to y and then 10 to x to its cache.
Thread 2 reads the new values and stores 10 to y to its cache.
Thread 3 didn't read the values from thread 1, but reads those of thread 2, and then the assert fails.
This might be completely different from what happens in reality, the point is anything can happen.
To ensure a happens-before relation between the multiple reads and writes, see Brian's answer.
Another construct that provides happens-before relations is std::mutex, which is why they are free from such insanities.
The answer to your question is the C++ standard.
The section [intro.races] is surprisingly very clear (which is not the rule of normative text kind: formalism consistency oftenly hurts readability).
I have read many books and tuto which treats the subject of memory order, but it just confused me.
Finally I have read the C++ standard, the section [intro.multithread] is the clearest I have found. Taking the time to read it carefully (twice) may save you some time!
The answer to your question is in [intro.races]/4:
All modifications to a particular atomic object M occur in some particular total order, called the modification
order of M. [ Note: There is a separate order for each atomic object. There is no requirement that these can
be combined into a single total order for all objects. In general this will be impossible since different threads
may observe modifications to different objects in inconsistent orders. — end note ]
You were expecting a single total order on all atomic operations. There is such an order, but only for atomic operations that are memory_order_seq_cst as explained in [atomics.order]/3:
There shall be a single total order S on all memory_order_seq_cst operations, consistent with the “happens
before” order and modification orders for all affected locations [...]
I have tried to establish an invariant between two atomic counters in one thread while ensuring that this invariant was maintained during read on an other thread without using a mutex.
Nevertheless looking at the code, it appears that I just implemented some sort of locking algorithm using the two atomic counters (with the risk I just messed up).
Is it possible to share an invariant between threads without using a locking strategy?
EDIT: The term invariant may not be adequate.
Let say I have two variables a and b, at some point in the program execution a thread A set a and b to some distinct values, and after that I woukd like that if a thread B load the value a stored by A and then load b, the loaded value of b is the one stored by A and not a value stored later in b.
Is it possible to share an invariant between threads without using a locking strategy?
No.
Why? Imagine the following scenario: Thread X and Y, counters a and b.
Thread X writes into a and b, which causes the invariant to break. Assuming that thread Y attempts to read a and b, while the invariant is broken and without some locking strategy.
At this case thread Y may read stale (or too fresh) data, since a may be the value that X wrote now, or the previous one. Same stands for b.
This is classic bug in multithreading, which results in data races.
It is similar as the double-linked list example discribed here. In that instance, threads X and Y exist and manipulate a double linked list. One of the Invariants is that if we follow a next pointer from one node (1) to another (2), the previous pointer from that node (2) points back to the first node (1).
Assuming that thread X deletes a node of the list, thread Y attemps to read at that point and can cause problems, because there is a chance that the pointers of the previous and next nodes of the node to be deleted by X are not yet set (thus the invariant is broken at the time at thread Y attempts to read the list).
A data race occurs when two threads access the same variable concurrently and at least one of the accesses is a write.
https://isocpp.org/wiki/faq/cpp11-language-concurrency
// start with x==0 and y==0
if (x) y = 1; // Thread 1
if (y) x = 1; // Thread 2
Is there a problem here? More precisely, is there a data race? (No
there isn’t).
Why does the original article claim that there is no data race here?
Neither thread will be writing since neither variable is non-zero before the conditionals.
Data races are not static properties of your code. They are properties of the actual state of the program at execution time. So while that program could be in a state where the code would produce a data race, that's not the question.
The question is, given the state of the system, will the code cause a data race? And since the program is in a state such that neither thread will write to either variable, then the code will not cause a data race.
Data races aren't about what your code might do. It's about what they will do. Just as a function that takes a pointer isn't undefined behavior just because it uses the pointer without checking for NULL. It is only UB if someone passes a pointer that really is NULL.
Because x and y are both zero, the abstract machine defined by the C++ standard can't write to either memory location, so the only way this could be a problem is if the implementation decided to write to the memory location anyway. For example, if it transformed
if (x) y = 1;
into
y = 1;
if (!x) y = 0;
This is a potentially valid rewrite under the as-if rule since the observed behavior by any one thread is the same (C++14 1.9 [intro.execution])
The semantic descriptions in this International Standard define a parameterized nondeterministic abstract
machine. This International Standard places no requirement on the structure of conforming implementations.
In particular, they need not copy or emulate the structure of the abstract machine. Rather, conforming
implementations are required to emulate (only) the observable behavior of the abstract machine as explained
below.
This would in fact have been a valid rewrite prior to C++11, but since C++11, threads of execution are considered. Because of this, the implementation is not allowed to make changes that would have different observed behavior across threads as long as no data race occurs in the abstract machine.
There's a special note in the C++14 standard that applies here (C++14 1.10 [into.multithread] paragraph 22)
[ Note: Compiler transformations that introduce assignments to a potentially shared memory location that
would not be modified by the abstract machine are generally precluded by this standard, since such an
assignment might overwrite another assignment by a different thread in cases in which an abstract machine
execution would not have encountered a data race.
...
Because of this, the rewrite isn't valid. The implementation has to preserve the observed behavior that x and y are not modified, even across threads. Therefore, there is no data race.
I found this article written by Hans-J. Boehm illuminating:
http://www.hpl.hp.com/techreports/2009/HPL-2009-259html.html#races
We say that two ordinary memory operations conflict if they access the
same memory location (for example, variable or array element), and at
least one of them writes to the location.
We say that a program allows a data race on a particular set of inputs
if there is a sequentially consistent execution, that is an
interleaving of operations of the individual threads, in which two
conflicting operations can be executed "simultaneously". For our
purposes, two such operations can be executed "simultaneously", if
they occur next to each other in the interleaving, and correspond to
different threads.
And the article goes on to our point:
Our definition of data race is rather stringent: There must be an
actual way to execute the original, untransformed program such that
conflicting operations occur in parallel. This imposes a burden on
compilers not to "break" programs by introducing harmful data races.
As stated in the article, which report the same example (and others):
There's no sequentially consistent execution of this program in which Thread 1 assigns to y, since x and y never become nonzero. Indeed you never satisfy the condition, so no one is writing to the variable that the other thread might be reading.
To understand the difference with the case where a data race exists, try to think about the following example in the article:
y = ((x != 0)? 1 : y) # Thread 1
y = 2; # Thread 2
In this last case it's clear that can happen that y is assigned (write) to by Thread 1 while Thread 2 executes y = 2; (y is written to by Thread 1 no matter what). A data race can happen.
If x is not set, setting y to 1 doesn't happen and vice versa. So, things here are indeed happening sequentially.
I'm trying to learn the basics about low-level concurrency.
From Linux documentation:
A write memory barrier gives a guarantee that all the STORE operations
specified before the barrier will appear to happen before all the STORE
operations specified after the barrier with respect to the other
components of the system.
I think that "all the STORE operations" must mean that there are more instances of a particular barrier type than one and there is probably 1:N relationship between a barrier instance and a STORE. Where can I find confirmation for this?
Memory barriers are not related to any specific memory locations.
It's not about "write to memory address x should happen before write to address y", it's about execution order of instructions, e.g. for program
x = 2
y = 1
processor may decide: "I don't want to wait until 2 will be finally stored in x, I can start writing 1 to y while x = 2 is still in progress" (also known as out-of-order execution/reordering), so reader on another core may observe 0 in x (its initial value) after observing 1 in y which is counter-intuitive behaviour.
If you place write barrier between two stores, then reader can be sure that if he observes results of second store, the first one is also happened, so if he reads y == 1, then it's known that x == 2 (It's not that easy though, because reads can be executed out of order too, so you need read barrier). In other words, such barrier forbids executing y = 1 while x = 2 is not finished.
As #RafaelWinterhalter mentioned, there is awesome guide for JVM compiler writers, which has many concrete examples how barriers are mapped to real code.
As additional reading see Preshing blog, it has many articles about low level concurrency, e.g. this one about barriers.