I was reading Bjarne Stroustrup's C++11 FAQ and I'm having trouble understanding an example in the memory model section.
He gives the following code snippet:
// start with x==0 and y==0
if (x) y = 1; // thread 1
if (y) x = 1; // thread 2
The FAQ says there is not a data race here. I don't understand. The memory location x is read by thread 1 and written to by thread 2 without any synchronization (and the same goes for y). That's two accesses, one of which is a write. Isn't that the definition of a data race?
Further, it says that "every current C++ compiler (that I know of) gives the one right answer." What is this one right answer? Couldn't the answer vary depending on whether one thread's comparison happens before or after the other thread's write (or if the other thread's write is even visible to the reading thread)?
// start with x==0 and y==0
if (x) y = 1; // thread 1
if (y) x = 1; // thread 2
Since neither x nor y is true, the other won't be set to true either. No matter the order the instructions are executed, the (correct) result is always x remains 0, y remains 0.
The memory location x is ... written to by thread 2
Is it really? Why do you say so?
If y is 0 then x is not written to by thread 2. And y starts out 0. Similarly, x cannot be non-zero unless somehow y is non-zero "before" thread 1 runs, and that cannot happen. The general point here is that conditional writes that don't execute don't cause a data race.
This is a non-trivial fact of the memory model, though, because a compiler that is not aware of threading would be permitted (assuming y is not volatile) to transform the code if (x) y = 1; to int tmp = y; y = 1; if (!x) y = tmp;. Then there would be a data race. I can't imagine why it would want to do that exact transformation, but that doesn't matter, the point is that optimizers for non-threaded environments can do things that would violate the threaded memory model. So when Stroustrup says that every compiler he knows of gives the right answer (right under C++11's threading model, that is), that's a non-trivial statement about the readiness of those compilers for C++11 threading.
A more realistic transformation of if (x) y = 1 would be y = x ? 1 : y;. I believe that this would cause a data race in your example, and that there is no special treatment in the standard for the assignment y = y that makes it safe to execute unsequenced with respect to a read of y in another thread. You might find it hard to imagine hardware on which it doesn't work, and anyway I may be wrong, which is why I used a different example above that's less realistic but has a blatant data race.
There has to be a total ordering of the writes, because of the fact that no thread can write to the variable x or y until some other thread has first written a 1 to either variable. In other words you have basically three different scenarios:
thread 1 gets to write to y because x was written to at some previous point before the if statement, and then if thread 2 comes later, it writes to x the same value of 1, and doesn't change it's previous value of 1.
thread 2 gets to write to x because y was changed at some point before the if statement, and then thread 1 will write to y if it comes later the same value of 1.
If there are only two threads, then the if statements are jumped over because x and y remain 0.
Neither of the writes occurs, so there is no race. Both x and y remain zero.
(This is talking about the problem of phantom writes. Suppose one thread speculatively did the write before checking the condition, then attempted to correct things after. That would break the other thread, so it isn't allowed.)
Memory model set the supportable size of code and data areas.before comparing linking source code,we need to specify the memory model that is he can set the size limitsthe data and code.
Related
I am quite new to C++ atomics and memory order and cannot wrap my head around one point. Consider the example taken directly from there: https://preshing.com/20120612/an-introduction-to-lock-free-programming/#sequential-consistency
std::atomic<int> X(0), Y(0);
int r1, r2;
void thread1()
{
X.store(1);
r1 = Y.load();
}
void thread2()
{
Y.store(1);
r2 = X.load();
}
There are sixteen possible outcomes in the total memory order:
thread 1 store -> thread 1 load -> thread 2 store -> thread 2 load
thread 1 store -> thread 2 store -> thread 1 load -> thread 2 load
...
Does sequential consistent program guarantee that if a particular store operation on some atomic variable happens before a load operation performed on the same atomic variable (but in another thread), the load will always see latest value stored (i.e., second point on the list above, where two stores happens before two loads in the total order)? In other words, if one put assert(r1 != 0 && r2 != 0) later in the program, would it be possible to fire the assert? According to the article such situation is not possible to take place. However, there is a quote taken from another thread where Anthony Williams commented on that: Concurrency: Atomic and volatile in C++11 memory model
"The default memory ordering of std::memory_order_seq_cst provides a single global total order for all std::memory_order_seq_cst operations across all variables. This doesn't mean that you can't get stale values, but it does mean that the value you do get determines and is determined by where in this total order your operation lies.
Who is right, who is wrong? Or maybe it's only my misunderstanding and both answers are correct.
All the statements you quoted are correct. I think the confusion is coming from ambiguity in terms like "latest" and "stale", which could refer either to the sequentially consistent total order, or to ordering in real time. Those two orders do not have to be consistent with each other, and only the former is relevant to describing the program's observable behavior.
Let's start by looking at your program and then come back to the terminology afterwards.
There are sixteen possible outcomes in the total memory order:
No, there are only six. Let's call the operations XS, XL, YS, YL for the stores and loads to X and Y respectively. The total order has to be consistent with the sequencing (program order) of each thread, hence the name "sequential consistency". So XS has to precede YL in the total order, and YS has to precede XL.
Does sequential consistent program guarantee that if a particular store operation on some atomic variable happens before a load operation performed on the same atomic variable (but in another thread), the load will always see latest value stored (i.e., second point on the list above, where two stores happens before two loads in the total order)?
Careful, let's not use the phrase "happens before", as that refers to a different partial order in the memory model, which we do not have to consider when interested only in ordering of seq_cst atomic operations.
Sequential consistency does guarantee reading the "latest" value stored, where by "latest" we mean with respect to the total order. To be precise, each load L of a variable X takes its value from the unique store S to X which satisfies the following conditions: S precedes L, and every other store to X precedes S. So in your program, XL will return 1 if it follows XS in the total order, otherwise it will return 0.
Thus here are the possible total orders, and the corresponding values returned by XL and YL (your r2 and r1 respectively):
XS, YL, YS, XL: here XL == 1 and YL == 0.
XS, YS, YL, XL: here XL == 1 and YL == 1.
XS, YS, XL, YL: here XL == 1 and YL == 1.
YS, XS, YL, XL: here XL == 1 and YL == 1.
YS, XS, XL, YL: here XL == 1 and YL == 1.
YS, XL, XS, YL: here XL == 0 and YL == 1.
Note there are no orderings resulting in XL == 0 and YL == 0. That would require XL to precede XS, and YL to precede YS. But program order already requires that XS precedes YL and YS precedes XL. That would make a cycle, which by definition of a total order is not allowed.
In other words, if one put assert(r1 != 0 && r2 != 0) later in the program, would it be possible to fire the assert? According to the article such situation is not possible to take place.
I think you misread Preshing's article, or maybe you just have a typo in your question. Preshing is saying that r1 and r2 cannot both be zero, i.e., that assert(r1 != 0 || r2 != 0) would not fire. That is absolutely correct. But your assertion with && certainly could fire, in the case of orders 1 or 6 above.
"This doesn't mean that you can't get stale values, but it does mean that the value you do get determines and is determined by where in this total order your operation lies." [Anthony Williams]
Here Anthony means "stale" in the sense of real time. For instance, it is quite possible that XS executes at time 12:00:00.0000001 and XL executes at time 12:00:00.0000002, but XL still loads the value 0. There can be real-time "lag" before an operation becomes globally visible.
But if this happens, it means we are in a total ordering in which XL precedes XS. That makes the total ordering inconsistent with wall clock time, but that is allowed. What cannot happen is for such "lag" to reverse the ordering of visibility for two operations from the same thread. In this example, the machine might have to delay the execution of YL until time 12:00:00.0000003 so that it does not become visible before XS. The compiler would be responsible for inserting appropriate barrier instructions to ensure that this will happen.
(This sets aside the fact that on a modern CPU, it doesn't even make sense to talk about the "time" at which an instruction executes. An instruction can execute in several stages spanning many clock cycles, and even within a single core, this may be happening for several instructions at once. The machine is required to preserve the illusion of program order for the core observing its own operations, but not necessarily when they are observed by other cores.)
Because of the total order, it is actually valid to treat all seq_cst operations as happening at distinct ticks of some global "clock", where visibility and causality are preserved with respect to this clock. It's just that this clock may not always be running forwards in time with respect to the clock on your wall.
This question already has answers here:
c++ what happens when in one thread write and in second read the same object? (is it safe?) [duplicate]
(4 answers)
Closed 2 years ago.
Since I've started multi-threading, I've been asking myself this one question :
Is writing and reading a variable from different threads undefined behavior?
Let's use the minimal example where we increment an integer in a thread and read the integer inside another one.
void thread1()
{
x++;
}
void thread2()
{
if (x == 5)
{
//doSomething
}
}
I understand that the addition operation is not atomic and therefore I could make a read from the second thread while the first thread is in the middle of the adding operation, but there is something i'm not quite sure of.
Does x keeps his value until the whole addition operation is completed and then is assigned this new value, or does x have an intermediate state where reading from it would result in undefined behavior.
If the first theory applies, then reading from x while it's being writing to would simply return the value before the addition and wouldn't be so problematic.
If the second theory is true, could someone explain more in detail what is the process of the addition operation and why it would be undefined behavior (maybe with an example?)
Thanks
The comments already got the basics right.
The compiler, when compiling a single function may consider the ways in which a variable is changed. If the function cannot directly or indirectly change a certain variable, then the compiler may assume that there is no change to that variable whatsoever, unless there's thread synchronization. In that case the compiler must deal with the possibility of another thread changing those variables.
If the compiler assumption is violated (i.e. you have a bug), then literally anything may happen. This is not constrained, because that would severely restrict optimizers. You may make some assumptions that x has some unique address in memory, but optimizers are known to move variables around and have multiple variables share a single address (just at different times). Such optimizations may very well be justified based on a single-thread assumption, one that your example is violating. Your second thread may think it's looking at x, but it might also be getting y.
x (32bit variable) will be always defined on 32+bits cpu however not so precisely. You know that x can be any value from start up to end range defined by ++.
like in following case: x is initialized to 0 and you call 5 times thread1 the thread 2 can see this x in range from 0 to 5.
It means I can consider assignment of integer to memory as atomic.
There are some reasons why x on both thread is not synchronized e.g. while x on thread1 is 5 on the thread2 can be 0 in the same time.
One of the reason is cpu cache which is different for each core. To synchronise the value between caches you have to use memory barriers. You can use for example std::atomic which do a great job for you
System calls in UNIX-like OSes are reentrant (i.e. multiple system calls may be executed in parallel). Are there any ordering constraints of those system calls in the sense of the C/C++11 happens-before relation?
For example, let's consider the following program with 3 threads (in pseudo-code):
// thread 1
store x 1
store y 2
// thread 2
store y 1
store x 2
// thread 3
Thread.join(1 and 2)
wait((load x) == 1 && (load y) == 1)
Here, suppose x and y are shared locations, and all the load and stores have the relaxed ordering. (NOTE: with relaxed atomic accesses, races are not considered bug; they are intentional in the sense of C/C++11 semantics.) This program may terminate, since (1) compiler may reorder store x 1 and store y 2, and then (2) execute store y 2, store y 1, store x 2, and then store x 1, so (3) thread 3 may read x = 1 and y = 1 at the same time.
I would like to know if the following program also may terminate. Here, some system calls syscall1() & syscall2() are inserted in the thread 1 & 2, respectively:
// thread 1
store x 1
syscall1()
store y 2
// thread 2
store y 1
syscall2()
store x 2
// thread 3
Thread.join(1 and 2)
wait((load x) == 1 && (load y) == 1)
The program seems impossible to terminate. However, in the absence of the ordering constraints of the system calls invoked, I think this program may terminate. Here is the reason. Suppose syscall1() and syscall2() are not serialized and may be run in parallel. Then the compiler, with the full knowledge of the semantics of syscall1() and syscall2(), may still reorder store x 1 & syscall1() and store y 2.
So I would like to ask if there are any ordering constraints of the system calls invoked by different threads. If possible, I would like to know the authoritative source for this kind of questions.
A system call (those listed in syscalls(2)...) is an elementary operation, from the point of view of an application program in user land.
Each system call is (by definition) calling the kernel, thru some single machine code instruction (SYSENTER, SYSCALL, INT ...); details depend upon the processor (its instruction set) and the ABI. The kernel does it business (of processing your system call - which could succeed or fail), but your user program sees only an elementary step. Sometimes that step (during which control is given to the kernel) could last a long piece of time (e.g. minutes or hours).
So your program in user land runs in a low level virtual machine, provided by the user mode machine instructions of your processor augmented by a single "virtual" system call instruction (able of doing any system call implemented by the kernel).
This does not forbid your program to be buggy because of race conditions.
I am looking at a C++ class which has the following lines:
while( x > y );
return x - y;
x and y are member variables of type volatile int. I do not understand this construct.
I found the code stub here: https://gist.github.com/r-lyeh/cc50bbed16759a99a226. I guess it is not guaranteed to be correct or even work.
Since x and y have been declared as volatile, the programmer expects that they will be changed from outside the program.
In that case, your code will remain in the loop
while(x>y);
and will return the value x-y after the values are changed from outside such that x <= y. The exact reason why this is written can be guessed after you tell us more about your code and where you saw it. The while loop in this case is a waiting technique for some other event to occur.
It seems
while( x > y );
is a spinning loop. It won't stop until x <= y. As x and y are volatile, they may be changed outside of this routine. So, once x <= y becomes true, x - y will be returned. This technique is used to wait for some event.
Update
According to the gist you added, it seems the idea was to implement thread-safe lock-free circular buffer. Yes, the implementation is incorrect. For example, the original code-snippet is
unsigned count() const {
while( tail > head );
return head - tail;
}
Even if tail becomes less or equal to head, it is not guaranteed that head - tail returns positive number. The scheduler may switch execution to another thread immediately after the while loop, and that thread may change head value. Anyway, there are a lot of other issues related to how reading and writing to shared memory work (memory reordering etc.), so just ignore this code.
Other replies have already pointed out in detail what the instruction does, but just to recap that, since y (or head in the linked example) is declared as volatile changes made to that variable from a different thread will cause the while loop to finish once the condition has been met.
However, even though the linked code example is very short, it's a near perfect example of how NOT to write code.
First of all the line
while( tail > head );
will waste enormous amounts of CPU cycles, pretty much locking up one core until the condition has been met.
The code gets even better as we go along.
buffer[head++ % N] = item;
Thanks to JAB for pointing out i mistook post- with pre-increment here. Corrected the implications.
Since there are no locks or mutexes we obviously will have to assume the worst. The thread will switch after assigning the value in item and before head++ executes. Murphy will then call the function containing this statement again, assigning the value of item at the same head position.
After that head increments. Now we switch back to the first thread and increment head again. So instead of
buffer[original_value_of_head+1] = item_from_thread1;
buffer[original_value_of_head+2] = item_from_thread2;
we end up with
buffer[original_value_of_head+1] = item_from_thread2;
buffer[original_value_of_head+2] = whatever_was_there_previously;
You might get away with sloppy coding like this on the client side with few threads, but on the server side this could only be considered a ticking time bomb. Please use synchronisation constructs such as locks or mutexes instead.
And well, just for the sake of completeness, the line
while( tail > head );
in the method pop_back() should be
while( tail >= head );
unless you want to be able to pop one more element than you actually pushed in (or even pop one element before pushing anything in).
Sorry for writing what basically boils down to a long rant, but if this keeps just one person from copying and pasting that obscene code it was worth while.
Update: Thought i might as well give an example where a code like while(x>y); actually makes perfect sense.
Actually you used to see code like that fairly often in the "good old" days. cough DOS.
Was´nt used in the context of threading though. Mainly as a fallback in case registering an interrupt hook was not possible (you kids might translate that as "not possible to register an event handler").
startAsynchronousDeviceOperation(..);
That might be pretty much anything, e.g. tell the hardisk to read data via DMA, or tell the soundcard to record via DMA, possibly even invoke functions on a different processor (like the GPU). Typically initiated via something like outb(2).
while(byteswritten==0); // or while (status!=DONE);
If the only communication channel with a device is shared memory, then so be it. Wouldnt expect to see code like that nowadays outside of device drivers and microcontrollers though. Obviously assumes the specs state that memory location is the last one written to.
The volatile keyword is designed to prevent certain optimisations. In this case, without the keyword, the compiler could unroll your while loop into a concrete sequence of instructions which will obviously break in reality since the values could be modified externally.
Imagine the following:
int i = 2;
while (i-- > 0) printf("%d", i);
Most compilers will look at this and simply generate two calls to printf - adding the volatile keyword will result in it instead generating the CPU instructions that invoke a counter set to 2 and a check on the value after every iteration.
For example,
volatile int i = 2;
this_function_runs_on_another_process_and_modifies_the_value(&i);
while(i-- > 0) printf("%d", i);
I'm just reading the C++ concurrency in action book by Anthony Williams.
There is this classic example with two threads, one produce data, the other one consumes the data and A.W. wrote that code pretty clear :
std::vector<int> data;
std::atomic<bool> data_ready(false);
void reader_thread()
{
while(!data_ready.load())
{
std::this_thread::sleep(std::milliseconds(1));
}
std::cout << "The answer=" << data[0] << "\n";
}
void writer_thread()
{
data.push_back(42);
data_ready = true;
}
And I really don't understand why this code differs from one where I'd use a classic volatile bool instead of the atomic one.
If someone could open my mind on the subject, I'd be grateful.
Thanks.
A "classic" bool, as you put it, would not work reliably (if at all). One reason for this is that the compiler could (and most likely does, at least with optimizations enabled) load data_ready only once from memory, because there is no indication that it ever changes in the context of reader_thread.
You could work around this problem by using volatile bool to enforce loading it every time (which would probably seem to work) but this would still be undefined behavior regarding the C++ standard because the access to the variable is neither synchronized nor atomic.
You could enforce synchronization using the locking facilities from the mutex header, but this would introduce (in your example) unnecessary overhead (hence std::atomic).
The problem with volatile is that it only guarantees that instructions are not omitted and the instruction ordering is preserved. volatile does not guarantee a memory barrier to enforce cache coherence. What this means is that writer_thread on processor A can write the value to it's cache (and maybe even to the main memory) without reader_thread on processor B seeing it, because the cache of processor B is not consistent with the cache of processor A. For a more thorough explanation see memory barrier and cache coherence on Wikipedia.
There can be additional problems with more complex expressions than x = y (i.e. x += y) that would require synchronization through a lock (or in this simple case an atomic +=) to ensure the value of x does not change during processing.
x += y for example is actually:
read x
compute x + y
write result back to x
If a context switch to another thread occurs during the computation this can result in something like this (2 threads, both doing x += 2; assuming x = 0):
Thread A Thread B
------------------------ ------------------------
read x (0)
compute x (0) + 2
<context switch>
read x (0)
compute x (0) + 2
write x (2)
<context switch>
write x (2)
Now x = 2 even though there were two += 2 computations. This effect is known as tearing.
The big difference is that this code is correct, while the version with bool instead of atomic<bool> has undefined behavior.
These two lines of code create a race condition (formally, a conflict) because they read from and write to the same variable:
Reader
while (!data_ready)
And writer
data_ready = true;
And a race condition on a normal variable causes undefined behavior, according to the C++11 memory model.
The rules are found in section 1.10 of the Standard, the most relevant being:
Two actions are potentially concurrent if
they are performed by different threads, or
they are unsequenced, and at least one is performed by a signal handler.
The execution of a program contains a data race if it contains two potentially concurrent conflicting actions, at least one of which is not atomic, and neither happens before the other, except for the special case for signal handlers described below. Any such data race results in undefined behavior.
You can see that whether the variable is atomic<bool> makes a very big difference to this rule.
Ben Voigt's answer is completely correct, still a little theoretical, and as I've been asked by a colleague "what does this mean for me", I decided to try my luck with a little more practical answer.
With your sample, the "simplest" optimization problem that could occur is the following:
According to the Standard, an optimized execution order may not change the functionality of a program. Problem is, this is only true for single threaded programs, or single threads in multithreaded programs.
So, for writer_thread and a (volatile) bool
data.push_back(42);
data_ready = true;
and
data_ready = true;
data.push_back(42);
are equivalent.
The result is, that
std::cout << "The answer=" << data[0] << "\n";
can be executed without having pushed any value into data.
An atomic bool does prevent this kind of optimization, as per definition it may not be reordered. There are flags for atomic operations which allow statements to be moved in front of the operation but not to the back, and vice versa, but those require a really advanced knowledge of your programming structure and the problems it can cause...