std::cout is thread safe, but can cause race conditions? - c++

I'm looking at an online course for C++ multithreading, and I see the following:
If std::cout can have race conditions, then how is it that it's thread-safe? Isn't thread-safety, by definition, free of race conditions?

It's distinguishing between the ordering of calls to operator<< and outputting of each character as a result of those calls.
Specifically, there is no conflict between threads on a per-character basis but, if you have two threads output pax and diablo respectively, you may end up with any of (amongst others):
paxdiablo
diablopax
pdaixablo
padiablox
diapaxblo
What you quoted text is stating is that there will be no intermixing of the output of (for example) p and d that would cause a data race.
And a race condition isn't necessarily a problem, nor a thread-safety issue, it just means the behaviour can vary based on ordering. It can become such an issue if the arbitrary ordering can corrupt data in some way but that's not the case here (unless you consider badly formatted output to be corrupt).
It's similar to the statement:
result = a * 7 + b * 4;
ISO C++ doesn't actually mandate the order in which a * 7 or b * 4 is evaluated so you could consider that a race condition (it's quite plausible, though unlikely, that the individual calculations could be handed off to separate CPUs for parallelism), but in no way is that going to be a problem.
Interestingly, ISO C++ makes no mention of race conditions (which may become a problem), only of data races (which are almost certainly a problem).

There are many levels of thread safety. Some programs must complete in exactly the same order every time. They must be as-if they were executed in a single thread. Others permit some level of flexibility in the order in which statements are executed. The spec's statements about std::cout state that it is in the latter category. The order that characters will be printed to the screen is unspecified.
You can see the two behaviors in the act of compiling. In most cases, we don't care about the order that things get compiled in. We happily type make -j8 and have 8 threads (processes) compiling on parallel. Whatever thread does the work is fine by us. On the other hand, consider the compiling requirements of NixOS. In that distro, you compile every application yourself and cross-check it against hashes posted by the authors. When compiling on NixOS, within a compilation unit (one .c or .cpp), you absolutely must operate as-if there is a single thread because you need the exact same product deterministically -- every time.
There is a darker version of this kind of race behavior known as a "data race." This is when two writers try to write the same variable at the same time, or a reader tries to read it while a writer tries to write to it, all without any synchronization. This is undefined behavior. Indeed, one of my favorite pages on programming goes painstakingly through all the things that can go wrong if there is a data rate.
So what is the spec saying about needing to lock cout? What it's saying is that the interthread behavior is specified, and that there is no data race. Thus, you do not need to guard the output stream.
Typically, if you find you want to guard it, it's a good time to take a step back and evaluate your choices. Sure, you can put a mutex on your usage of cout, but you can't put it on any libraries that you are using, so you'll still have issues with interleaving if they print anything. It may be a good time to look at your output system. You may want to start passing messages to a centralized pool (protected by a mutex), and have one thread that writes them out to the screen.

The text is there to warn you that building some thing entirely out of "thread-safe" components does not guarantee that the thing you built will be "thread-safe."
The std::cout stream is called "thread-safe" because using it in multiple threads without any other synchronization will not break any of its underlying mechanism. The cout stream will behave in predictable, reasonable ways; Everything that various threads write will appear in the output, nothing will appear that wasn't written by some thread, everything written by any one thread will appear in the same order that the one thread wrote it, etc. Also, it won't throw unexpected exceptions, or segfault your program, or overwrite random variables.
But, if your program allows several threads to "race" to write std::cout, there is nothing that the authors of the library could do to help predict which thread will win. If you expect to see data written by several threads appear in some particular order, then it's your responsibility to ensure that the threads call the library in that same order.
Don't forget that a statement like this is not one std::cout function call; it's three separate calls.
std::cout << "there are " << number_of_doodles << " doodles here.\n";
If some other thread concurrently executes this statement (also three calls):
std::cout << "I like " << name_of_favorite_pet << ".\n";
Then the output of the six separate std::cout calls could be interleaved, e.g.:
I like there are 5 fido doodles here.
.

Related

Why does not-changing the while's variable condition not result in a loop? Edited [duplicate]

Updated, see below!
I have heard and read that C++0x allows an compiler to print "Hello" for the following snippet
#include <iostream>
int main() {
while(1)
;
std::cout << "Hello" << std::endl;
}
It apparently has something to do with threads and optimization capabilities. It looks to me that this can surprise many people though.
Does someone have a good explanation of why this was necessary to allow? For reference, the most recent C++0x draft says at 6.5/5
A loop that, outside of the for-init-statement in the case of a for statement,
makes no calls to library I/O functions, and
does not access or modify volatile objects, and
performs no synchronization operations (1.10) or atomic operations (Clause 29)
may be assumed by the implementation to terminate. [ Note: This is intended to allow compiler transfor-
mations, such as removal of empty loops, even when termination cannot be proven. — end note ]
Edit:
This insightful article says about that Standards text
Unfortunately, the words "undefined behavior" are not used. However, anytime the standard says "the compiler may assume P," it is implied that a program which has the property not-P has undefined semantics.
Is that correct, and is the compiler allowed to print "Bye" for the above program?
There is an even more insightful thread here, which is about an analogous change to C, started off by the Guy done the above linked article. Among other useful facts, they present a solution that seems to also apply to C++0x (Update: This won't work anymore with n3225 - see below!)
endless:
goto endless;
A compiler is not allowed to optimize that away, it seems, because it's not a loop, but a jump. Another guy summarizes the proposed change in C++0x and C201X
By writing a loop, the programmer is asserting either that the
loop does something with visible behavior (performs I/O, accesses
volatile objects, or performs synchronization or atomic operations),
or that it eventually terminates. If I violate that assumption
by writing an infinite loop with no side effects, I am lying to the
compiler, and my program's behavior is undefined. (If I'm lucky,
the compiler might warn me about it.) The language doesn't provide
(no longer provides?) a way to express an infinite loop without
visible behavior.
Update on 3.1.2011 with n3225: Committee moved the text to 1.10/24 and say
The implementation may assume that any thread will eventually do one of the following:
terminate,
make a call to a library I/O function,
access or modify a volatile object, or
perform a synchronization operation or an atomic operation.
The goto trick will not work anymore!
To me, the relevant justification is:
This is intended to allow compiler transfor- mations, such as removal of empty loops, even when termination cannot be proven.
Presumably, this is because proving termination mechanically is difficult, and the inability to prove termination hampers compilers which could otherwise make useful transformations, such as moving nondependent operations from before the loop to after or vice versa, performing post-loop operations in one thread while the loop executes in another, and so on. Without these transformations, a loop might block all other threads while they wait for the one thread to finish said loop. (I use "thread" loosely to mean any form of parallel processing, including separate VLIW instruction streams.)
EDIT: Dumb example:
while (complicated_condition()) {
x = complicated_but_externally_invisible_operation(x);
}
complex_io_operation();
cout << "Results:" << endl;
cout << x << endl;
Here, it would be faster for one thread to do the complex_io_operation while the other is doing all the complex calculations in the loop. But without the clause you have quoted, the compiler has to prove two things before it can make the optimisation: 1) that complex_io_operation() doesn't depend on the results of the loop, and 2) that the loop will terminate. Proving 1) is pretty easy, proving 2) is the halting problem. With the clause, it may assume the loop terminates and get a parallelisation win.
I also imagine that the designers considered that the cases where infinite loops occur in production code are very rare and are usually things like event-driven loops which access I/O in some manner. As a result, they have pessimised the rare case (infinite loops) in favour of optimising the more common case (noninfinite, but difficult to mechanically prove noninfinite, loops).
It does, however, mean that infinite loops used in learning examples will suffer as a result, and will raise gotchas in beginner code. I can't say this is entirely a good thing.
EDIT: with respect to the insightful article you now link, I would say that "the compiler may assume X about the program" is logically equivalent to "if the program doesn't satisfy X, the behaviour is undefined". We can show this as follows: suppose there exists a program which does not satisfy property X. Where would the behaviour of this program be defined? The Standard only defines behaviour assuming property X is true. Although the Standard does not explicitly declare the behaviour undefined, it has declared it undefined by omission.
Consider a similar argument: "the compiler may assume a variable x is only assigned to at most once between sequence points" is equivalent to "assigning to x more than once between sequence points is undefined".
Does someone have a good explanation of why this was necessary to allow?
Yes, Hans Boehm provides a rationale for this in N1528: Why undefined behavior for infinite loops?, although this is WG14 document the rationale applies to C++ as well and the document refers to both WG14 and WG21:
As N1509 correctly points out, the current draft essentially gives
undefined behavior to infinite loops in 6.8.5p6. A major issue for
doing so is that it allows code to move across a potentially
non-terminating loop. For example, assume we have the following loops,
where count and count2 are global variables (or have had their address
taken), and p is a local variable, whose address has not been taken:
for (p = q; p != 0; p = p -> next) {
++count;
}
for (p = q; p != 0; p = p -> next) {
++count2;
}
Could these two loops be merged and replaced by the following loop?
for (p = q; p != 0; p = p -> next) {
++count;
++count2;
}
Without the special dispensation in 6.8.5p6 for infinite loops, this
would be disallowed: If the first loop doesn't terminate because q
points to a circular list, the original never writes to count2. Thus
it could be run in parallel with another thread that accesses or
updates count2. This is no longer safe with the transformed version
which does access count2 in spite of the infinite loop. Thus the
transformation potentially introduces a data race.
In cases like this, it is very unlikely that a compiler would be able
to prove loop termination; it would have to understand that q points
to an acyclic list, which I believe is beyond the ability of most
mainstream compilers, and often impossible without whole program
information.
The restrictions imposed by non-terminating loops are a restriction on
the optimization of terminating loops for which the compiler cannot
prove termination, as well as on the optimization of actually
non-terminating loops. The former are much more common than the
latter, and often more interesting to optimize.
There are clearly also for-loops with an integer loop variable in
which it would be difficult for a compiler to prove termination, and
it would thus be difficult for the compiler to restructure loops
without 6.8.5p6. Even something like
for (i = 1; i != 15; i += 2)
or
for (i = 1; i <= 10; i += j)
seems nontrivial to handle. (In the former case, some basic number
theory is required to prove termination, in the latter case, we need
to know something about the possible values of j to do so. Wrap-around
for unsigned integers may complicate some of this reasoning further.)
This issue seems to apply to almost all loop restructuring
transformations, including compiler parallelization and
cache-optimization transformations, both of which are likely to gain
in importance, and are already often important for numerical code.
This appears likely to turn into a substantial cost for the benefit of
being able to write infinite loops in the most natural way possible,
especially since most of us rarely write intentionally infinite loops.
The one major difference with C is that C11 provides an exception for controlling expressions that are constant expressions which differs from C++ and makes your specific example well-defined in C11.
I think the correct interpretation is the one from your edit: empty infinite loops are undefined behavior.
I wouldn't say it's particularly intuitive behavior, but this interpretation makes more sense than the alternative one, that the compiler is arbitrarily allowed to ignore infinite loops without invoking UB.
If infinite loops are UB, it just means that non-terminating programs aren't considered meaningful: according to C++0x, they have no semantics.
That does make a certain amount of sense too. They are a special case, where a number of side effects just no longer occur (for example, nothing is ever returned from main), and a number of compiler optimizations are hampered by having to preserve infinite loops. For example, moving computations across the loop is perfectly valid if the loop has no side effects, because eventually, the computation will be performed in any case.
But if the loop never terminates, we can't safely rearrange code across it, because we might just be changing which operations actually get executed before the program hangs. Unless we treat a hanging program as UB, that is.
The relevant issue is that the compiler is allowed to reorder code whose side effects do not conflict. The surprising order of execution could occur even if the compiler produced non-terminating machine code for the infinite loop.
I believe this is the right approach. The language spec defines ways to enforce order of execution. If you want an infinite loop that cannot be reordered around, write this:
volatile int dummy_side_effect;
while (1) {
dummy_side_effect = 0;
}
printf("Never prints.\n");
I think this is along the lines of the this type of question, which references another thread. Optimization can occasionally remove empty loops.
I think the issue could perhaps best be stated, as "If a later piece of code does not depend on an earlier piece of code, and the earlier piece of code has no side-effects on any other part of the system, the compiler's output may execute the later piece of code before, after, or intermixed with, the execution of the former, even if the former contains loops, without regard for when or whether the former code would actually complete. For example, the compiler could rewrite:
void testfermat(int n)
{
int a=1,b=1,c=1;
while(pow(a,n)+pow(b,n) != pow(c,n))
{
if (b > a) a++; else if (c > b) {a=1; b++}; else {a=1; b=1; c++};
}
printf("The result is ");
printf("%d/%d/%d", a,b,c);
}
as
void testfermat(int n)
{
if (fork_is_first_thread())
{
int a=1,b=1,c=1;
while(pow(a,n)+pow(b,n) != pow(c,n))
{
if (b > a) a++; else if (c > b) {a=1; b++}; else {a=1; b=1; c++};
}
signal_other_thread_and_die();
}
else // Second thread
{
printf("The result is ");
wait_for_other_thread();
}
printf("%d/%d/%d", a,b,c);
}
Generally not unreasonable, though I might worry that:
int total=0;
for (i=0; num_reps > i; i++)
{
update_progress_bar(i);
total+=do_something_slow_with_no_side_effects(i);
}
show_result(total);
would become
int total=0;
if (fork_is_first_thread())
{
for (i=0; num_reps > i; i++)
total+=do_something_slow_with_no_side_effects(i);
signal_other_thread_and_die();
}
else
{
for (i=0; num_reps > i; i++)
update_progress_bar(i);
wait_for_other_thread();
}
show_result(total);
By having one CPU handle the calculations and another handle the progress bar updates, the rewrite would improve efficiency. Unfortunately, it would make the progress bar updates rather less useful than they should be.
Does someone have a good explanation of why this was necessary to allow?
I have an explanation of why it is necessary not to allow, assuming that C++ still has the ambition to be a general-purpose language suitable for performance-critical, low-level programming.
This used to be a working, strictly conforming freestanding C++ program:
int main()
{
setup_interrupts();
for(;;)
{}
}
The above is a perfectly fine and common way to write main() in many low-end microcontroller systems. The whole program execution is done inside interrupt service routines (or in case of RTOS, it could be processes). And main() may absolutely not be allowed to return since these are bare metal systems and there is nobody to return to.
Typical real-world examples of where the above design can be used are PWM/motor driver microcontrollers, lighting control applications, simple regulator systems, sensor applications, simple household electronics and so on.
Changes to C++ have unfortunately made the language impossible to use for this kind of embedded systems programming. Existing real-world applications like the ones above will break in dangerous ways if ported to newer C++ compilers.
C++20 6.9.2.3 Forward progress [intro.progress]
The implementation may assume that any thread will eventually do one of the following:
(1.1) — terminate,
(1.2) — make a call to a library I/O function,
(1.3) — perform an access through a volatile glvalue, or
(1.4) — perform a synchronization operation or an atomic operation.
Lets address each bullet for the above, previously strictly conforming freestanding C++ program:
1.1. As any embedded systems beginner can tell the committee, bare metal/RTOS microcontroller systems never terminate. Therefore the loop cannot terminate either or main() would turn into runaway code, which would be a severe and dangerous error condition.
1.2 N/A.
1.3 Not necessarily. One may argue that the for(;;) loop is the proper place to feed the watchdog, which in turn is a side effect as it writes to a volatile register. But there may be implementation reasons why you don't want to use a watchdog. At any rate, it is not C++'s business to dictate that a watchdog should be used.
1.4 N/A.
Because of this, C++ is made even more unsuitable for embedded systems applications than it already was before.
Another problem here is that the standard speaks of "threads", which are higher level concepts. On real-world computers, threads are implemented through a low-level concept known as interrupts. Interrupts are similar to threads in terms of race conditions and concurrent execution, but they are not threads. On low-level systems there is just one core and therefore only one interrupt at a time is typically executed (kind of like threads used to work on single core PC back in the days).
And you can't have threads if you can't have interrupts. So threads would have to be implemented by a lower-level language more suitable for embedded systems than C++. The options being assembler or C.
By writing a loop, the programmer is asserting either that the loop does something with visible behavior (performs I/O, accesses volatile objects, or performs synchronization or atomic operations), or that it eventually terminates.
This is completely misguided and clearly written by someone who has never worked with microcontroller programming.
So what should those few remaining C++ embedded programmers who refuse to port their code to C "because of reasons" do? You have to introduce a side effect inside the for(;;) loop:
int main()
{
setup_interrupts();
for(volatile int i=0; i==0;)
{}
}
Or if you have the watchdog enabled as you ought to, feed it inside the for(;;) loop.
It is not decidable for the compiler for non-trivial cases if it is an infinite loop at all.
In different cases, it can happen that your optimiser will reach a better complexity class for your code (e.g. it was O(n^2) and you get O(n) or O(1) after optimisation).
So, to include such a rule that disallows removing an infinite loop into the C++ standard would make many optimisations impossible. And most people don't want this. I think this quite answers your question.
Another thing: I never have seen any valid example where you need an infinite loop which does nothing.
The one example I have heard about was an ugly hack that really should be solved otherwise: It was about embedded systems where the only way to trigger a reset was to freeze the device so that the watchdog restarts it automatically.
If you know any valid/good example where you need an infinite loop which does nothing, please tell me.
I think it's worth pointing out that loops which would be infinite except for the fact that they interact with other threads via non-volatile, non-synchronised variables can now yield incorrect behaviour with a new compiler.
I other words, make your globals volatile -- as well as arguments passed into such a loop via pointer/reference.

Why this code run endless in debug build but exits immediatly in release build? [duplicate]

Updated, see below!
I have heard and read that C++0x allows an compiler to print "Hello" for the following snippet
#include <iostream>
int main() {
while(1)
;
std::cout << "Hello" << std::endl;
}
It apparently has something to do with threads and optimization capabilities. It looks to me that this can surprise many people though.
Does someone have a good explanation of why this was necessary to allow? For reference, the most recent C++0x draft says at 6.5/5
A loop that, outside of the for-init-statement in the case of a for statement,
makes no calls to library I/O functions, and
does not access or modify volatile objects, and
performs no synchronization operations (1.10) or atomic operations (Clause 29)
may be assumed by the implementation to terminate. [ Note: This is intended to allow compiler transfor-
mations, such as removal of empty loops, even when termination cannot be proven. — end note ]
Edit:
This insightful article says about that Standards text
Unfortunately, the words "undefined behavior" are not used. However, anytime the standard says "the compiler may assume P," it is implied that a program which has the property not-P has undefined semantics.
Is that correct, and is the compiler allowed to print "Bye" for the above program?
There is an even more insightful thread here, which is about an analogous change to C, started off by the Guy done the above linked article. Among other useful facts, they present a solution that seems to also apply to C++0x (Update: This won't work anymore with n3225 - see below!)
endless:
goto endless;
A compiler is not allowed to optimize that away, it seems, because it's not a loop, but a jump. Another guy summarizes the proposed change in C++0x and C201X
By writing a loop, the programmer is asserting either that the
loop does something with visible behavior (performs I/O, accesses
volatile objects, or performs synchronization or atomic operations),
or that it eventually terminates. If I violate that assumption
by writing an infinite loop with no side effects, I am lying to the
compiler, and my program's behavior is undefined. (If I'm lucky,
the compiler might warn me about it.) The language doesn't provide
(no longer provides?) a way to express an infinite loop without
visible behavior.
Update on 3.1.2011 with n3225: Committee moved the text to 1.10/24 and say
The implementation may assume that any thread will eventually do one of the following:
terminate,
make a call to a library I/O function,
access or modify a volatile object, or
perform a synchronization operation or an atomic operation.
The goto trick will not work anymore!
To me, the relevant justification is:
This is intended to allow compiler transfor- mations, such as removal of empty loops, even when termination cannot be proven.
Presumably, this is because proving termination mechanically is difficult, and the inability to prove termination hampers compilers which could otherwise make useful transformations, such as moving nondependent operations from before the loop to after or vice versa, performing post-loop operations in one thread while the loop executes in another, and so on. Without these transformations, a loop might block all other threads while they wait for the one thread to finish said loop. (I use "thread" loosely to mean any form of parallel processing, including separate VLIW instruction streams.)
EDIT: Dumb example:
while (complicated_condition()) {
x = complicated_but_externally_invisible_operation(x);
}
complex_io_operation();
cout << "Results:" << endl;
cout << x << endl;
Here, it would be faster for one thread to do the complex_io_operation while the other is doing all the complex calculations in the loop. But without the clause you have quoted, the compiler has to prove two things before it can make the optimisation: 1) that complex_io_operation() doesn't depend on the results of the loop, and 2) that the loop will terminate. Proving 1) is pretty easy, proving 2) is the halting problem. With the clause, it may assume the loop terminates and get a parallelisation win.
I also imagine that the designers considered that the cases where infinite loops occur in production code are very rare and are usually things like event-driven loops which access I/O in some manner. As a result, they have pessimised the rare case (infinite loops) in favour of optimising the more common case (noninfinite, but difficult to mechanically prove noninfinite, loops).
It does, however, mean that infinite loops used in learning examples will suffer as a result, and will raise gotchas in beginner code. I can't say this is entirely a good thing.
EDIT: with respect to the insightful article you now link, I would say that "the compiler may assume X about the program" is logically equivalent to "if the program doesn't satisfy X, the behaviour is undefined". We can show this as follows: suppose there exists a program which does not satisfy property X. Where would the behaviour of this program be defined? The Standard only defines behaviour assuming property X is true. Although the Standard does not explicitly declare the behaviour undefined, it has declared it undefined by omission.
Consider a similar argument: "the compiler may assume a variable x is only assigned to at most once between sequence points" is equivalent to "assigning to x more than once between sequence points is undefined".
Does someone have a good explanation of why this was necessary to allow?
Yes, Hans Boehm provides a rationale for this in N1528: Why undefined behavior for infinite loops?, although this is WG14 document the rationale applies to C++ as well and the document refers to both WG14 and WG21:
As N1509 correctly points out, the current draft essentially gives
undefined behavior to infinite loops in 6.8.5p6. A major issue for
doing so is that it allows code to move across a potentially
non-terminating loop. For example, assume we have the following loops,
where count and count2 are global variables (or have had their address
taken), and p is a local variable, whose address has not been taken:
for (p = q; p != 0; p = p -> next) {
++count;
}
for (p = q; p != 0; p = p -> next) {
++count2;
}
Could these two loops be merged and replaced by the following loop?
for (p = q; p != 0; p = p -> next) {
++count;
++count2;
}
Without the special dispensation in 6.8.5p6 for infinite loops, this
would be disallowed: If the first loop doesn't terminate because q
points to a circular list, the original never writes to count2. Thus
it could be run in parallel with another thread that accesses or
updates count2. This is no longer safe with the transformed version
which does access count2 in spite of the infinite loop. Thus the
transformation potentially introduces a data race.
In cases like this, it is very unlikely that a compiler would be able
to prove loop termination; it would have to understand that q points
to an acyclic list, which I believe is beyond the ability of most
mainstream compilers, and often impossible without whole program
information.
The restrictions imposed by non-terminating loops are a restriction on
the optimization of terminating loops for which the compiler cannot
prove termination, as well as on the optimization of actually
non-terminating loops. The former are much more common than the
latter, and often more interesting to optimize.
There are clearly also for-loops with an integer loop variable in
which it would be difficult for a compiler to prove termination, and
it would thus be difficult for the compiler to restructure loops
without 6.8.5p6. Even something like
for (i = 1; i != 15; i += 2)
or
for (i = 1; i <= 10; i += j)
seems nontrivial to handle. (In the former case, some basic number
theory is required to prove termination, in the latter case, we need
to know something about the possible values of j to do so. Wrap-around
for unsigned integers may complicate some of this reasoning further.)
This issue seems to apply to almost all loop restructuring
transformations, including compiler parallelization and
cache-optimization transformations, both of which are likely to gain
in importance, and are already often important for numerical code.
This appears likely to turn into a substantial cost for the benefit of
being able to write infinite loops in the most natural way possible,
especially since most of us rarely write intentionally infinite loops.
The one major difference with C is that C11 provides an exception for controlling expressions that are constant expressions which differs from C++ and makes your specific example well-defined in C11.
I think the correct interpretation is the one from your edit: empty infinite loops are undefined behavior.
I wouldn't say it's particularly intuitive behavior, but this interpretation makes more sense than the alternative one, that the compiler is arbitrarily allowed to ignore infinite loops without invoking UB.
If infinite loops are UB, it just means that non-terminating programs aren't considered meaningful: according to C++0x, they have no semantics.
That does make a certain amount of sense too. They are a special case, where a number of side effects just no longer occur (for example, nothing is ever returned from main), and a number of compiler optimizations are hampered by having to preserve infinite loops. For example, moving computations across the loop is perfectly valid if the loop has no side effects, because eventually, the computation will be performed in any case.
But if the loop never terminates, we can't safely rearrange code across it, because we might just be changing which operations actually get executed before the program hangs. Unless we treat a hanging program as UB, that is.
The relevant issue is that the compiler is allowed to reorder code whose side effects do not conflict. The surprising order of execution could occur even if the compiler produced non-terminating machine code for the infinite loop.
I believe this is the right approach. The language spec defines ways to enforce order of execution. If you want an infinite loop that cannot be reordered around, write this:
volatile int dummy_side_effect;
while (1) {
dummy_side_effect = 0;
}
printf("Never prints.\n");
I think this is along the lines of the this type of question, which references another thread. Optimization can occasionally remove empty loops.
I think the issue could perhaps best be stated, as "If a later piece of code does not depend on an earlier piece of code, and the earlier piece of code has no side-effects on any other part of the system, the compiler's output may execute the later piece of code before, after, or intermixed with, the execution of the former, even if the former contains loops, without regard for when or whether the former code would actually complete. For example, the compiler could rewrite:
void testfermat(int n)
{
int a=1,b=1,c=1;
while(pow(a,n)+pow(b,n) != pow(c,n))
{
if (b > a) a++; else if (c > b) {a=1; b++}; else {a=1; b=1; c++};
}
printf("The result is ");
printf("%d/%d/%d", a,b,c);
}
as
void testfermat(int n)
{
if (fork_is_first_thread())
{
int a=1,b=1,c=1;
while(pow(a,n)+pow(b,n) != pow(c,n))
{
if (b > a) a++; else if (c > b) {a=1; b++}; else {a=1; b=1; c++};
}
signal_other_thread_and_die();
}
else // Second thread
{
printf("The result is ");
wait_for_other_thread();
}
printf("%d/%d/%d", a,b,c);
}
Generally not unreasonable, though I might worry that:
int total=0;
for (i=0; num_reps > i; i++)
{
update_progress_bar(i);
total+=do_something_slow_with_no_side_effects(i);
}
show_result(total);
would become
int total=0;
if (fork_is_first_thread())
{
for (i=0; num_reps > i; i++)
total+=do_something_slow_with_no_side_effects(i);
signal_other_thread_and_die();
}
else
{
for (i=0; num_reps > i; i++)
update_progress_bar(i);
wait_for_other_thread();
}
show_result(total);
By having one CPU handle the calculations and another handle the progress bar updates, the rewrite would improve efficiency. Unfortunately, it would make the progress bar updates rather less useful than they should be.
Does someone have a good explanation of why this was necessary to allow?
I have an explanation of why it is necessary not to allow, assuming that C++ still has the ambition to be a general-purpose language suitable for performance-critical, low-level programming.
This used to be a working, strictly conforming freestanding C++ program:
int main()
{
setup_interrupts();
for(;;)
{}
}
The above is a perfectly fine and common way to write main() in many low-end microcontroller systems. The whole program execution is done inside interrupt service routines (or in case of RTOS, it could be processes). And main() may absolutely not be allowed to return since these are bare metal systems and there is nobody to return to.
Typical real-world examples of where the above design can be used are PWM/motor driver microcontrollers, lighting control applications, simple regulator systems, sensor applications, simple household electronics and so on.
Changes to C++ have unfortunately made the language impossible to use for this kind of embedded systems programming. Existing real-world applications like the ones above will break in dangerous ways if ported to newer C++ compilers.
C++20 6.9.2.3 Forward progress [intro.progress]
The implementation may assume that any thread will eventually do one of the following:
(1.1) — terminate,
(1.2) — make a call to a library I/O function,
(1.3) — perform an access through a volatile glvalue, or
(1.4) — perform a synchronization operation or an atomic operation.
Lets address each bullet for the above, previously strictly conforming freestanding C++ program:
1.1. As any embedded systems beginner can tell the committee, bare metal/RTOS microcontroller systems never terminate. Therefore the loop cannot terminate either or main() would turn into runaway code, which would be a severe and dangerous error condition.
1.2 N/A.
1.3 Not necessarily. One may argue that the for(;;) loop is the proper place to feed the watchdog, which in turn is a side effect as it writes to a volatile register. But there may be implementation reasons why you don't want to use a watchdog. At any rate, it is not C++'s business to dictate that a watchdog should be used.
1.4 N/A.
Because of this, C++ is made even more unsuitable for embedded systems applications than it already was before.
Another problem here is that the standard speaks of "threads", which are higher level concepts. On real-world computers, threads are implemented through a low-level concept known as interrupts. Interrupts are similar to threads in terms of race conditions and concurrent execution, but they are not threads. On low-level systems there is just one core and therefore only one interrupt at a time is typically executed (kind of like threads used to work on single core PC back in the days).
And you can't have threads if you can't have interrupts. So threads would have to be implemented by a lower-level language more suitable for embedded systems than C++. The options being assembler or C.
By writing a loop, the programmer is asserting either that the loop does something with visible behavior (performs I/O, accesses volatile objects, or performs synchronization or atomic operations), or that it eventually terminates.
This is completely misguided and clearly written by someone who has never worked with microcontroller programming.
So what should those few remaining C++ embedded programmers who refuse to port their code to C "because of reasons" do? You have to introduce a side effect inside the for(;;) loop:
int main()
{
setup_interrupts();
for(volatile int i=0; i==0;)
{}
}
Or if you have the watchdog enabled as you ought to, feed it inside the for(;;) loop.
It is not decidable for the compiler for non-trivial cases if it is an infinite loop at all.
In different cases, it can happen that your optimiser will reach a better complexity class for your code (e.g. it was O(n^2) and you get O(n) or O(1) after optimisation).
So, to include such a rule that disallows removing an infinite loop into the C++ standard would make many optimisations impossible. And most people don't want this. I think this quite answers your question.
Another thing: I never have seen any valid example where you need an infinite loop which does nothing.
The one example I have heard about was an ugly hack that really should be solved otherwise: It was about embedded systems where the only way to trigger a reset was to freeze the device so that the watchdog restarts it automatically.
If you know any valid/good example where you need an infinite loop which does nothing, please tell me.
I think it's worth pointing out that loops which would be infinite except for the fact that they interact with other threads via non-volatile, non-synchronised variables can now yield incorrect behaviour with a new compiler.
I other words, make your globals volatile -- as well as arguments passed into such a loop via pointer/reference.

What formally guarantees that non-atomic variables can't see out-of-thin-air values and create a data race like atomic relaxed theoretically can?

This is a question about the formal guarantees of the C++ standard.
The standard points out that the rules for std::memory_order_relaxed atomic variables allow "out of thin air" / "out of the blue" values to appear.
But for non-atomic variables, can this example have UB? Is r1 == r2 == 42 possible in the C++ abstract machine? Neither variable == 42 initially so you'd expect neither if body should execute, meaning no writes to the shared variables.
// Global state
int x = 0, y = 0;
// Thread 1:
r1 = x;
if (r1 == 42) y = r1;
// Thread 2:
r2 = y;
if (r2 == 42) x = 42;
The above example is adapted from the standard, which explicitly says such behavior is allowed by the specification for atomic objects:
[Note: The requirements do allow r1 == r2 == 42 in the following
example, with x and y initially zero:
// Thread 1:
r1 = x.load(memory_order_relaxed);
if (r1 == 42) y.store(r1, memory_order_relaxed);
// Thread 2:
r2 = y.load(memory_order_relaxed);
if (r2 == 42) x.store(42, memory_order_relaxed);
However, implementations should not allow such behavior. – end note]
What part of the so called "memory model" protects non atomic objects from these interactions caused by reads seeing out-of-thin-air values?
When a race condition would exist with different values for x and y, what guarantees that read of a shared variable (normal, non atomic) cannot see such values?
Can not-executed if bodies create self-fulfilling conditions that lead to a data-race?
The text of your question seems to be missing the point of the example and out-of-thin-air values. Your example does not contain data-race UB. (It might if x or y were set to 42 before those threads ran, in which case all bets are off and the other answers citing data-race UB apply.)
There is no protection against real data races, only against out-of-thin-air values.
I think you're really asking how to reconcile that mo_relaxed example with sane and well-defined behaviour for non-atomic variables. That's what this answer covers.
The note is pointing out a hole in the atomic mo_relaxed formalism, not warning you of a real possible effect on some implementations.
This gap does not (I think) apply to non-atomic objects, only to mo_relaxed.
They say However, implementations should not allow such behavior. – end note]. Apparently the standards committee couldn't find a way to formalize that requirement so for now it's just a note, but is not intended to be optional.
It's clear that even though this isn't strictly normative, the C++ standard intends to disallow out-of-thin-air values for relaxed atomic (and in general I assume). Later standards discussion, e.g. 2018's p0668r5: Revising the C++ memory model (which doesn't "fix" this, it's an unrelated change) includes juicy side-nodes like:
We still do not have an acceptable way to make our informal (since C++14) prohibition of out-of-thin-air results precise. The primary practical effect of that is that formal verification of C++ programs using relaxed atomics remains unfeasible. The above paper suggests a solution similar to http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2013/n3710.html . We continue to ignore the problem here ...
So yes, the normative parts of the standard are apparently weaker for relaxed_atomic than they are for non-atomic. This seems to be an unfortunately side effect of how they define the rules.
AFAIK no implementations can produce out-of-thin-air values in real life.
Later versions of the standard phrase the informal recommendation more clearly, e.g. in the current draft: https://timsong-cpp.github.io/cppwp/atomics.order#8
Implementations should ensure that no “out-of-thin-air” values are computed that circularly depend on their own computation.
...
[ Note: The recommendation [of 8.] similarly disallows r1 == r2 == 42 in the following example, with x and y again initially zero:
// Thread 1:
r1 = x.load(memory_order::relaxed);
if (r1 == 42) y.store(42, memory_order::relaxed);
// Thread 2:
r2 = y.load(memory_order::relaxed);
if (r2 == 42) x.store(42, memory_order::relaxed);
— end note ]
(This rest of the answer was written before I was sure that the standard intended to disallow this for mo_relaxed, too.)
I'm pretty sure the C++ abstract machine does not allow r1 == r2 == 42.
Every possible ordering of operations in the C++ abstract machine operations leads to r1=r2=0 without UB, even without synchronization. Therefore the program has no UB and any non-zero result would violate the "as-if" rule.
Formally, ISO C++ allows an implementation to implement functions / programs in any way that gives the same result as the C++ abstract machine would. For multi-threaded code, an implementation can pick one possible abstract-machine ordering and decide that's the ordering that always happens. (e.g. when reordering relaxed atomic stores when compiling to asm for a strongly-ordered ISA. The standard as written even allows coalescing atomic stores but compilers choose not to). But the result of the program always has to be something the abstract machine could have produced. (Only the Atomics chapter introduces the possibility of one thread observing the actions of another thread without mutexes. Otherwise that's not possible without data-race UB).
I think the other answers didn't look carefully enough at this. (And neither did I when it was first posted). Code that doesn't execute doesn't cause UB (including data-race UB), and compilers aren't allowed to invent writes to objects. (Except in code paths that already unconditionally write them, like y = (x==42) ? 42 : y; which would obviously create data-race UB.)
For any non-atomic object, if don't actually write it then other threads might also be reading it, regardless of code inside not-executed if blocks. The standard allows this and doesn't allow a variable to suddenly read as a different value when the abstract machine hasn't written it. (And for objects we don't even read, like neighbouring array elements, another thread might even be writing them.)
Therefore we can't do anything that would let another thread temporarily see a different value for the object, or step on its write. Inventing writes to non-atomic objects is basically always a compiler bug; this is well known and universally agreed upon because it can break code that doesn't contain UB (and has done so in practice for a few cases of compiler bugs that created it, e.g. IA-64 GCC I think had such a bug at one point that broke the Linux kernel). IIRC, Herb Sutter mentioned such bugs in part 1 or 2 of his talk, atomic<> Weapons: The C++ Memory Model and Modern Hardware", saying that it was already usually considered a compiler bug before C++11, but C++11 codified that and made it easier to be sure.
Or another recent example with ICC for x86:
Crash with icc: can the compiler invent writes where none existed in the abstract machine?
In the C++ abstract machine, there's no way for execution to reach either y = r1; or x = r2;, regardless of sequencing or simultaneity of the loads for the branch conditions. x and y both read as 0 and neither thread ever writes them.
No synchronization is required to avoid UB because no order of abstract-machine operations leads to a data-race. The ISO C++ standard doesn't have anything to say about speculative execution or what happens when mis-speculation reaches code. That's because speculation is a feature of real implementations, not of the abstract machine. It's up to implementations (HW vendors and compiler writers) to ensure the "as-if" rule is respected.
It's legal in C++ to write code like if (global_id == mine) shared_var = 123; and have all threads execute it, as long as at most one thread actually runs the shared_var = 123; statement. (And as long as synchronization exists to avoid a data race on non-atomic int global_id). If things like this broke down, it would be chaos. For example, you could apparently draw wrong conclusions like reordering atomic operations in C++
Observing that a non-write didn't happen isn't data-race UB.
It's also not UB to run if(i<SIZE) return arr[i]; because the array access only happens if i is in bounds.
I think the "out of the blue" value-invention note only applies to relaxed-atomics, apparently as a special caveat for them in the Atomics chapter. (And even then, AFAIK it can't actually happen on any real C++ implementations, certainly not mainstream ones. At this point implementations don't have to take any special measures to make sure it can't happen for non-atomic variables.)
I'm not aware of any similar language outside the atomics chapter of the standard that allows an implementation to allow values to appear out of the blue like this.
I don't see any sane way to argue that the C++ abstract machine causes UB at any point when executing this, but seeing r1 == r2 == 42 would imply that unsynchronized read+write had happened, but that's data-race UB. If that can happen, can an implementation invent UB because of speculative execution (or some other reason)? The answer has to be "no" for the C++ standard to be usable at all.
For relaxed atomics, inventing the 42 out of nowhere wouldn't imply that UB had happened; perhaps that's why the standard says it's allowed by the rules? As far as I know, nothing outside the Atomics chapter of the standard allows it.
A hypothetical asm / hardware mechanism that could cause this
(Nobody wants this, hopefully everyone agrees that it would be a bad idea to build hardware like this. It seems unlikely that coupling speculation across logical cores would ever be worth the downside of having to roll back all cores when one detects a mispredict or other mis-speculation.)
For 42 to be possible, thread 1 has to see thread 2's speculative store and the store from thread 1 has to be seen by thread 2's load. (Confirming that branch speculation as good, allowing this path of execution to become the real path that was actually taken.)
i.e. speculation across threads: Possible on current HW if they ran on the same core with only a lightweight context switch, e.g. coroutines or green threads.
But on current HW, memory reordering between threads is impossible in that case. Out-of-order execution of code on the same core gives the illusion of everything happening in program order. To get memory reordering between threads, they need to be running on different cores.
So we'd need a design that coupled together speculation between two logical cores. Nobody does that because it means more state needs to rollback if a mispredict is detected. But it is hypothetically possible. For example an OoO SMT core that allows store-forwarding between its logical cores even before they've retired from the out-of-order core (i.e. become non-speculative).
PowerPC allows store-forwarding between logical cores for retired stores, meaning that threads can disagree about the global order of stores. But waiting until they "graduate" (i.e. retire) and become non-speculative means it doesn't tie together speculation on separate logical cores. So when one is recovering from a branch miss, the others can keep the back-end busy. If they all had to rollback on a mispredict on any logical core, that would defeat a significant part of the benefit of SMT.
I thought for a while I'd found an ordering that lead to this on single core of a real weakly-ordered CPUs (with user-space context switching between the threads), but the final step store can't forward to the first step load because this is program order and OoO exec preserves that.
T2: r2 = y; stalls (e.g. cache miss)
T2: branch prediction predicts that r2 == 42 will be true. ( x = 42 should run.
T2: x = 42 runs. (Still speculative; r2 = yhasn't obtained a value yet so ther2 == 42` compare/branch is still waiting to confirm that speculation).
a context switch to Thread 1 happens without rolling back the CPU to retirement state or otherwise waiting for speculation to be confirmed as good or detected as mis-speculation.
This part won't happen on real C++ implementations unless they use an M:N thread model, not the more common 1:1 C++ thread to OS thread. Real CPUs don't rename the privilege level: they don't take interrupts or otherwise enter the kernel with speculative instructions in flight that might need to rollback and redo entering kernel mode from a different architectural state.
T1: r1 = x; takes its value from the speculative x = 42 store
T1: r1 == 42 is found to be true. (Branch speculation happens here, too, not actually waiting for store-forwarding to complete. But along this path of execution, where the x = 42 did happen, this branch condition will execute and confirm the prediction).
T1: y = 42 runs.
this was all on the same CPU core so this y=42 store is after the r2=y load in program-order; it can't give that load a 42 to let the r2==42 speculation be confirmed. So this possible ordering doesn't demonstrate this in action after all. This is why threads have to be running on separate cores with inter-thread speculation for effects like this to be possible.
Note that x = 42 doesn't have a data dependency on r2 so value-prediction isn't required to make this happen. And the y=r1 is inside an if(r1 == 42) anyway so the compiler can optimize to y=42 if it wants, breaking the data dependency in the other thread and making things symmetric.
Note that the arguments about Green Threads or other context switch on a single core isn't actually relevant: we need separate cores for the memory reordering.
I commented earlier that I thought this might involve value-prediction. The ISO C++ standard's memory model is certainly weak enough to allow the kinds of crazy "reordering" that value-prediction can create to use, but it's not necessary for this reordering. y=r1 can be optimized to y=42, and the original code includes x=42 anyway so there's no data dependency of that store on the r2=y load. Speculative stores of 42 are easily possible without value prediction. (The problem is getting the other thread to see them!)
Speculating because of branch prediction instead of value prediction has the same effect here. And in both cases the loads need to eventually see 42 to confirm the speculation as correct.
Value-prediction doesn't even help make this reordering more plausible. We still need inter-thread speculation and memory reordering for the two speculative stores to confirm each other and bootstrap themselves into existence.
ISO C++ chooses to allow this for relaxed atomics, but AFAICT is disallows this non-atomic variables. I'm not sure I see exactly what in the standard does allow the relaxed-atomic case in ISO C++ beyond the note saying it's not explicitly disallowed. If there was any other code that did anything with x or y then maybe, but I think my argument does apply to the relaxed atomic case as well. No path through the source in the C++ abstract machine can produce it.
As I said, it's not possible in practice AFAIK on any real hardware (in asm), or in C++ on any real C++ implementation. It's more of an interesting thought-experiment into crazy consequences of very weak ordering rules, like C++'s relaxed-atomic. (Those ordering rules don't disallow it, but I think the as-if rule and the rest of the standard does, unless there's some provision that allows relaxed atomics to read a value that was never actually written by any thread.)
If there is such a rule, it would only be for relaxed atomics, not for non-atomic variables. Data-race UB is pretty much all the standard needs to say about non-atomic vars and memory ordering, but we don't have that.
When a race condition potentially exists, what guarantees that a read of a shared variable (normal, non atomic) cannot see a write
There is no such guarantee.
When race condition exists, the behaviour of the program is undefined:
[intro.races]
Two actions are potentially concurrent if
they are performed by different threads, or
they are unsequenced, at least one is performed by a signal handler, and they are not both performed by the same signal handler invocation.
The execution of a program contains a data race if it contains two potentially concurrent conflicting actions, at least one of which is not atomic, and neither happens before the other, except for the special case for signal handlers described below. Any such data race results in undefined behavior. ...
The special case is not very relevant to the question, but I'll include it for completeness:
Two accesses to the same object of type volatile std::sig_­atomic_­t do not result in a data race if both occur in the same thread, even if one or more occurs in a signal handler. ...
What part of the so called "memory model" protects non atomic objects from these interactions caused by reads that see the interaction?
None. In fact, you get the opposite and the standard explicitly calls this out as undefined behavior. In [intro.races]\21 we have
The execution of a program contains a data race if it contains two potentially concurrent conflicting actions, at least one of which is not atomic, and neither happens before the other, except for the special case for signal handlers described below. Any such data race results in undefined behavior.
which covers your second example.
The rule is that if you have shared data in multiple threads, and at least one of those threads write to that shared data, then you need synchronization. Without that you have a data race and undefined behavior. Do note that volatile is not a valid synchronization mechanism. You need atomics/mutexs/condition variables to protect shared access.
Note: The specific examples I give here are apparently not accurate. I've assumed the optimizer can be somewhat more aggressive than it's apparently allowed to be. There is some excellent discussion about this in the comments. I'm going to have to investigate this further, but wanted to leave this note here as a warning.
Other people have given you answers quoting the appropriate parts of the standard that flat out state that the guarantee you think exists, doesn't. It appears that you're interpreting a part of the standard that says a certain weird behavior is permitted for atomic objects if you use memory_order_relaxed as meaning that this behavior is not permitted for non-atomic objects. This is a leap of inference that is explicitly addressed by other parts of the standard that declare the behavior undefined for non-atomic objects.
In practical terms, here is an order of events that might happen in thread 1 that would be perfectly reasonable, but result in the behavior you think is barred even if the hardware guaranteed that all memory access was completely serialized between CPUs. Keep in mind that the standard has to not only take into account the behavior of the hardware, but the behavior of optimizers, which often aggressively re-order and re-write code.
Thread 1 could be re-written by an optimizer to look this way:
old_y = y; // old_y is a hidden variable (perhaps a register) created by the optimizer
y = 42;
if (x != 42) y = old_y;
There might be perfectly reasonable reasons for an optimizer to do this. For example, it may decide that it's far more likely than not for 42 to be written into y, and for dependency reasons, the pipeline might work a lot better if the store into y occurs sooner rather than later.
The rule is that the apparent result must look as if the code you wrote is what was executed. But there is no requirement that the code you write bears any resemblance at all to what the CPU is actually told to do.
The atomic variables impose constraints on the ability of the compiler to re-write code as well as instructing the compiler to issue special CPU instructions that impose constraints on the ability of the CPU to re-order memory accesses. The constraints involving memory_order_relaxed are much stronger than what is ordinarily allowed. The compiler would generally be allowed to completely get rid of any reference to x and y at all if they weren't atomic.
Additionally, if they are atomic, the compiler must ensure that other CPUs see the entire variable as either with the new value or the old value. For example, if the variable is a 32-bit entity that crosses a cache line boundary and a modification involves changing bits on both sides of the cache line boundary, one CPU may see a value of the variable that is never written because it only sees an update to the bits on one side of the cache line boundary. But this is not allowed for atomic variables modified with memory_order_relaxed.
That is why data races are labeled as undefined behavior by the standard. The space of the possible things that could happen is probably a lot wilder than your imagination could account for, and certainly wider than any standard could reasonably encompass.
(Stackoverflow complains about too many comments I put above, so I gathered them into an answer with some modifications.)
The intercept you cite from from C++ standard working draft N3337 was wrong.
[Note: The requirements do allow r1 == r2 == 42 in the following
example, with x and y initially zero:
// Thread 1:
r1 = x.load(memory_order_relaxed);
if (r1 == 42)
y.store(r1, memory_order_relaxed);
// Thread 2:
r2 = y.load(memory_order_relaxed);
if (r2 == 42)
x.store(42, memory_order_relaxed);
A programming language should never allow this "r1 == r2 == 42" to happen.
This has nothing to do with memory model. This is required by causality, which is the basic logic methodology and the foundation of any programming language design. It is the fundamental contract between human and computer. Any memory model should abide by it. Otherwise it is a bug.
The causality here is reflected by the intra-thread dependences between operations within a thread, such as data dependence (e.g., read after write in same location) and control dependence (e.g., operation in a branch), etc. They cannot be violated by any language specification. Any compiler/processor design should respect the dependence in its committed result (i.e., externally visible result or program visible result).
Memory model is mainly about memory operation ordering among multi-processors, which should never violate the intra-thread dependence, although a weak model may allow the causality happening in one processor to be violated (or unseen) in another processor.
In your code snippet, both threads have (intra-thread) data dependence (load->check) and control dependence (check->store) that ensure their respective executions (within a thread) are ordered. That means, we can check the later op's output to determine if the earlier op has executed.
Then we can use simple logic to deduce that, if both r1 and r2 are 42, there must be a dependence cycle, which is impossible, unless you remove one condition check, which essentially breaks the dependence cycle. This has nothing to do with memory model, but intra-thread data dependence.
Causality (or more accurately, intra-thread dependence here) is defined in C++ std, but not so explicitly in early drafts, because dependence is more of micro-architecture and compiler terminology. In language spec, it is usually defined as operational semantics. For example, the control dependence formed by "if statement" is defined in the same version of draft you cited as "If the condition yields true the first substatement is executed. " That defines the sequential execution order.
That said, the compiler and processor can schedule one or more operations of the if-branch to be executed before the if-condition is resolved. But no matter how the compiler and processor schedule the operations, the result of the if-branch cannot be committed (i.e., become visible to the program) before the if-condition is resolved. One should distinguish between semantics requirement and implementation details. One is language spec, the other is how the compiler and processor implement the language spec.
Actually the current C++ standard draft has corrected this bug in https://timsong-cpp.github.io/cppwp/atomics.order#9 with a slight change.
[ Note: The recommendation similarly disallows r1 == r2 == 42 in the following example, with x and y again initially zero:
// Thread 1:
r1 = x.load(memory_order_relaxed);
if (r1 == 42)
y.store(42, memory_order_relaxed);
// Thread 2:
r2 = y.load(memory_order_relaxed);
if (r2 == 42)
x.store(42, memory_order_relaxed);

Are atomic types necessary in multi-threading? (OS X, clang, c++11)

I'm trying to demonstrate that it's very bad idea to not use std::atomic<>s but I can't manage to create an example that reproduces the failure. I have two threads and one of them does:
{
foobar = false;
}
and the other:
{
if (foobar) {
// ...
}
}
the type of foobar is either bool or std::atomic_bool and it's initialized to true. I'm using OS X Yosemite and even tried to use this trick to hint via CPU affinity that I want the threads to run on different cores. I run such operations in loops etc. and in any case, there's no observable difference in execution. I end up inspecting generated assembly with clang clang -std=c++11 -lstdc++ -O3 -S test.cpp and I see that the asm differences on read are minor (without atomic on left, with on right):
No mfence or something that "dramatic". On the write side, something more "dramatic" happens:
As you can see, the atomic<> version uses xchgb which uses an implicit lock. When I compile with a relatively old version of gcc (v4.5.2) I can see all sorts of mfences being added which also indicates there's a serious concern.
I kind of understand that "X86 implements a very strong memory model" (ref) and that mfences might not be necessary but does it mean that unless I want to write cross-platform code that e.g. supports ARM, I don't really need to put any atomic<>s unless I care for consistency at ns-level?
I've watched "atomic<> Weapons" from Herb Sutter but I'm still impressed with how difficult it is to create a simple example that reproduces those problems.
The big problem of data races is that they're undefined behavior, not guaranteed wrong behavior. And this, in conjunction with the the general unpredictability of threads and the strength of the x64 memory model, means that it gets really hard to create reproduceable failures.
A slightly more reliable failure mode is when the optimizer does unexpected things, because you can observe those in the assembly. Of course, the optimizer is notoriously finicky as well and might do something completely different if you change just one code line.
Here's an example failure that we had in our code at one point. The code implemented a sort of spin lock, but didn't use atomics.
bool operation_done;
void thread1() {
while (!operation_done) {
sleep();
}
// do something that depends on operation being done
}
void thread2() {
// do the operation
operation_done = true;
}
This worked fine in debug mode, but the release build got stuck. Debugging showed that execution of thread1 never left the loop, and looking at the assembly, we found that the condition was gone; the loop was simply infinite.
The problem was that the optimizer realized that under its memory model, operation_done could not possibly change within the loop (that would have been a data race), and thus it "knew" that once the condition was true once, it would be true forever.
Changing the type of operation_done to atomic_bool (or actually, a pre-C++11 compiler-specific equivalent) fixed the issue.
This is my own version of #Sebastian Redl's answer that fits the question more closely. I will still accept his for credit + kudos to #HansPassant for his comment which brought my attention back to writes which made everything clear - since as soon as I observed that the compiler was adding synchronization on writes, the problem turned to be that it wasn't optimizing bool as much as one would expect.
I was able to have a trivial program that reproduces the problem:
std::atomic_bool foobar(true);
//bool foobar = true;
long long cnt = 0;
long long loops = 400000000ll;
void thread_1() {
usleep(200000);
foobar = false;
}
void thread_2() {
while (loops--) {
if (foobar) {
++cnt;
}
}
std::cout << cnt << std::endl;
}
The main difference with my original code was that I used to have a usleep() inside the while loop. It was enough to prevent any optimizations within the while loop. The cleaner code above, yields the same asm for write:
but quite different for read:
We can see that in the bool case (left) clang brought the if (foobar) outside the loop. Thus when I run the bool case I get:
400000000
real 0m1.044s
user 0m1.032s
sys 0m0.005s
while when I run the atomic_bool case I get:
95393578
real 0m0.420s
user 0m0.414s
sys 0m0.003s
It's interesting that the atomic_bool case is faster - I guess because it does just 95 million incs on the counter contrary to 400 million in the bool case.
What is even more crazy-interesting though is this. If I move the std::cout << cnt << std::endl; out of the threaded code, after pthread_join(), the loop in the non-atomic case becomes just this:
i.e. there's no loop. It's just if (foobar!=0) cnt = loops;! Clever clang. Then the execution yields:
400000000
real 0m0.206s
user 0m0.001s
sys 0m0.002s
while the atomic_bool remains the same.
So more than enough evidence that we should use atomics. The only thing to remember is - don't put any usleep() on your benchmarks because even if it's small, it will prevent quite a few compiler optimizations.
In general, it is very rare that the use of atomic types actually does anything useful for you in multithreaded situations. It is more useful to implement things like mutexes, semaphores and so on.
One reason why it's not very useful: As soon as you have two values that both need to be changed in an atomic way, you are absolutely stuck. You can't do it with atomic values. And it's quite rare that I want to change a single value in an atomic way.
In iOS and MacOS X, the three methods to use are: Protecting the change using #synchronized. Avoiding multi-threaded access by running code on a sequential queue (may be the main queue). Using mutexes.
I hope you are aware that atomicity for boolean values is rather pointless. What you have is a race condition: One thread stores a value, another reads it. Atomicity doesn't make a difference here. It makes (or might make) a difference if two threads accessing a variable at exactly the same time causes problems. For example, if a variable is incremented on two threads at exactly the same time, is it guaranteed that the final result is increased by two? That requires atomicity (or one of the methods mentioned earlier).
Sebastian makes the ridiculous claim that atomicity fixes the data race: That's nonsense. In a data race, a reader will read a value either before or after it is changed, whether that value is atomic or not doesn't make any difference whatsoever. The reader will read the old value or the new value, so the behaviour is unpredictable. All that atomicity does is prevent the situation that the reader would read some in-between state. Which doesn't fix the data race.

do integer reads need to be critical section protected?

I have come across C++03 some code that takes this form:
struct Foo {
int a;
int b;
CRITICAL_SECTION cs;
}
// DoFoo::Foo foo_;
void DoFoo::Foolish()
{
if( foo_.a == 4 )
{
PerformSomeTask();
EnterCriticalSection(&foo_.cs);
foo_.b = 7;
LeaveCriticalSection(&foo_.cs);
}
}
Does the read from foo_.a need to be protected? e.g.:
void DoFoo::Foolish()
{
EnterCriticalSection(&foo_.cs);
int a = foo_.a;
LeaveCriticalSection(&foo_.cs);
if( a == 4 )
{
PerformSomeTask();
EnterCriticalSection(&foo_.cs);
foo_.b = 7;
LeaveCriticalSection(&foo_.cs);
}
}
If so, why?
Please assume the integers are 32-bit aligned. The platform is ARM.
Technically yes, but no on many platforms. First, let us assume that int is 32 bits (which is pretty common, but not nearly universal).
It is possible that the two words (16 bit parts) of a 32 bit int will be read or written to separately. On some systems, they will be read separately if the int isn't aligned properly.
Imagine a system where you can only do 32-bit aligned 32 bit reads and writes (and 16-bit aligned 16 bit reads and writes), and an int that straddles such a boundary. Initially the int is zero (ie, 0x00000000)
One thread writes 0xBAADF00D to the int, the other reads it "at the same time".
The writing thread first writes 0xBAAD to the high word of the int. The reader thread then reads the entire int (both high and low) getting 0xBAAD0000 -- which is a state that the int was never put into on purpose!
The writer thread then writes the low word 0xF00D.
As noted, on some platforms all 32 bit reads/writes are atomic, so this isn't a concern. There are other concerns, however.
Most lock/unlock code includes instructions to the compiler to prevent reordering across the lock. Without that prevention of reordering, the compiler is free to reorder things so long as it behaves "as-if" in a single threaded context it would have worked that way. So if you read a then b in code, the compiler could read b before it reads a, so long as it doesn't see an in-thread opportunity for b to be modified in that interval.
So possibly the code you are reading is using these locks to make sure that the read of the variable happens in the order written in the code.
Other issues are raised in the comments below, but I don't feel competent to address them: cache issues, and visibility.
Looking at this it seems that arm has quite relaxed memory model so you need a form of memory barrier to ensure that writes in one thread are visible when you'd expect them in another thread. So what you are doing, or else using std::atomic seems likely necessary on your platform. Unless you take this into account you can see updates out of order in different threads which would break your example.
I think you can use C++11 to ensure that integer reads are atomic, using (for example) std::atomic<int>.
The C++ standard says that there's a data race if one thread writes to a variable at the same time as another thread reads from that variable, or if two threads write to the same variable at the same time. It further says that a data race produces undefined behavior. So, formally, you must synchronize those reads and writes.
There are three separate issues when one thread reads data that was written by another thread. First, there is tearing: if writing requires more than a single bus cycle, it's possible for a thread switch to occur in the middle of the operation, and another thread could see a half-written value; there's an analogous problem if a read requires more than a single bus cycle. Second, there's visibility: each processor has its own local copy of the data that it's been working on recently, and writing to one processor's cache does not necessarily update another processor's cache. Third, there's compiler optimizations that reorder reads and writes in ways that would be okay within a single thread, but will break multi-threaded code. Thread-safe code has to deal with all three problems. That's the job of synchronization primitives: mutexes, condition variables, and atomics.
Although the integer read/write operation indeed will most likely be atomic, the compiler optimizations and processor cache will still give you problems if you don't do it properly.
To explain - the compiler will normally assume that the code is single-threaded and make many optimizations that rely on that. For example, it might change the order of instructions. Or, if it sees that the variable is only written and never read, it might optimize it away entirely.
The CPU will also cache that integer, so if one thread writes it, the other one might not get to see it until a lot later.
There are two things you can do. One is to wrap in in critical section like in your original code. The other is to mark the variable as volatile. That will signal the compiler that this variable will be accessed by multiple threads and will disable a range of optimizations, as well as placing special cache-sync instructions (aka "memory barriers") around accesses to the variable (or so I understand). Apparently this is wrong.
Added: Also, as noted by another answer, Windows has Interlocked APIs that can be used to avoid these issues for non-volatile variables.