Can I use char variable without lock in the multi-threading case - c++

As c/c++ standard said, the size of char must be 1. As my understanding, that means CPU guarantees that any read or write on a char must be done in one instruction.
Let's say we have many threads, which share a char variable:
char target = 1;
// thread a
target = 0;
// thread b
target = 1;
// thread 1
while (target == 1) {
// do something
}
// thread 2
while (target == 1) {
// do something
}
In a word, there are two kinds of threads: some of them are to set target into 0 or 1, and the others are to do some tasks if target == 1. The goal is that we can control the task-threds through modifying the value of target.
As my understanding, it doesn't seem that we need to use mutex/lock at all. But my coding experience gave me a strong feeling that we must use mutex/lock in this case.
I'm confused now. Should I use mutex/lock or not in this case?
You see, I can understand why we need mutex/lock in other cases, such as i++. Because i++ can't be done in only one instruction. So can target = 0 be done in one instruction, right? If so, does it mean that we don't need mutex/lock in this case?
Well, I know that we could use std::atomic, so my question is: is it OK to not use neither mutex/lcok nor std::atomic.

std::atomic guarantees that accessing a variable is atomic. From cppreference:
Each instantiation and full specialization of the std::atomic template
defines an atomic type. If one thread writes to an atomic object while
another thread reads from it, the behavior is well-defined (see memory
model for details on data races).
When a char actually is atomic (being size 1, is not sufficient), then std::atomic<char> needs no extra synchronization. However, on a platform where char is not atomic, std::atomic<char> guarantees that it can be read and written atomically by using a mutex or similar.
In practice, I'd expect char to be atomic, but the standard does not guarantee that.
Also consider that operations like eg += read and write the value, hence atomic reads and writes alone are not sufficient to safely call +=, while std::atomic<T> has a proper operator+=.
TL;DR
I'm confused now. Should I use mutex/lock or not in this case?
Let someone else take that decision for you. When you want something atomic, use a std::atomic<something> unless you want fine grained control over the synchronisation.

TL;DR
is it OK to not use neither mutex/lcok nor std::atomic
No
In general, it's not ok to assume things. If you need guarantees for something, then make sure you have them.
This is closely related to a common logical fallacy. Just because you cannot imagine why something could be true, that does not mean that it's true.
Longer version
As c/c++ standard said
There's no such thing as "C/C++" and definitely not "the C/C++ standard". They are two completely different languages with different standards. However, they do agree on this point. sizeof (char) is 1 in both languages.
(Sidenote: sizeof 'a' will yield different results.)
As my understanding, that means CPU guarantees that any read or write on a char must be done in one instruction.
That's not correct. The CPU has it's own specification, completely separate from the language standards. And there's nothing that says that this has to be true, even if it probably are in most or all cases.
Because i++ can't be done in only one instruction.
That is CPU dependent. The x86 architecture has an instruction for this. https://c9x.me/x86/html/file_module_x86_id_140.html
As my understanding, it doesn't seem that we need to use mutex/lock at all. But my coding experience gave me a strong feeling that we must use mutex/lock in this case.
Even if the target CPU does read and write in one instruction, which it probably does, there's nothing that says that the C or C++ code needs to be compiled to just that instruction.
The standards for both C and C++ describes the behavior of the code. Not how it is converted to assembly.
So no, you cannot make the assumptions you're doing.

In general, it cannot be assumed that reading or writing a char is an atomic operation. However, the target architecture may provide that guarantee. For embedded C programs it is common practice to rely on such underlying guarantees to avoid the overhead of synchronization mechanisms in certain situations.
In the example in the question it must be noted that even if reading/writing target is an atomic operation, the value could be changed at any time, so there is no guarantee that it will be 1 inside the while loops.

Related

Is atomic.load() is faster than calling it normal

Recently I have been using atomic numbers alot in c++ as i use threading too much and thread safe is important to me
Well, I had a problem with printf() function here is an example
atomic_uint64_t count = {0}
printf("%lu",count);
// It gives error couple of errors like atomic(cost atomic&) = delete; and use of deleted function atomic so i had to write it like this to make it work
printf("%lu",count.load());
// Or
printf("%lu",(uint64_t)count);
Well anyways i don't which is better for performance i really care about the speed
So i started to thinking about which is better to retrieve the value and use it in if conditions or anywhere else
Like
if(count.load() < 8 ){
// Do smth
}
or
if(count < 8){
// Do smth
}
Which is better for speed and performance and thanks.
They're exactly identical in their meaning (unless you pass a non-default memory order like count.load(std::memory_order_acquire)).
I'd expect there to be no difference in the generated assembly for all compilers across all ISAs, with optimization enabled of course. There isn't for GCC/clang/MSVC/ICC in code I've looked at on https://godbolt.org/. This is true regardless of surrounding code it's inlining into.
If there is ever a difference, and one is slower or takes more code-size, report that as a missed-optimization compiler bug in whatever compiler you're using. (Unless you had optimization disabled, then an extra level of calls to wrapper functions is possible.)
As for the error, that's because you're evaluating it in a context that doesn't already imply a type: as an operand for a variadic function (printf).
If there's enough context to imply that you want the underlying T value from an atomic<T> (which is what atomic_uint64_t is), then the operator T() overload is called, which is documented as being equivalent to .load(). Same deal for assignment and .store().
There aren't any other functions that let you access only the low 32 bits of an atomic 64-bit integer (unfortunately); even on a 32-bit machine, current compilers will actually go to the trouble of doing a 64-bit atomic load (which is efficient on some 32-bit machines, not on others), then discarding the high 32 if you cast the value to a narrower type. (This is a missed-optimization, but compilers truly don't optimize atomics for the moment.)
So there's no ambiguity being resolved by .load, or any way a cast can pick a different load.
One reason for the existence of .load() and .store() is that they take a std::memory_order parameter, which is defaulted to seq_cst but can be weaker if you just need atomicity but only acq/rel synchronization between threads. Or none at all with relaxed, just atomicity.
Another reason is to let you write foo.load() to remind readers of you code that this is an atomic variable, not just a plain primitive type. For that style reason I'd prefer count.load(). Presumably if you changed its type away from uint64_t, you'd want to change how you printed it, not still cast it to uint64_t. Using .load() will let the compiler warn you about the format-string mismatch if you change its type.

What formally guarantees that non-atomic variables can't see out-of-thin-air values and create a data race like atomic relaxed theoretically can?

This is a question about the formal guarantees of the C++ standard.
The standard points out that the rules for std::memory_order_relaxed atomic variables allow "out of thin air" / "out of the blue" values to appear.
But for non-atomic variables, can this example have UB? Is r1 == r2 == 42 possible in the C++ abstract machine? Neither variable == 42 initially so you'd expect neither if body should execute, meaning no writes to the shared variables.
// Global state
int x = 0, y = 0;
// Thread 1:
r1 = x;
if (r1 == 42) y = r1;
// Thread 2:
r2 = y;
if (r2 == 42) x = 42;
The above example is adapted from the standard, which explicitly says such behavior is allowed by the specification for atomic objects:
[Note: The requirements do allow r1 == r2 == 42 in the following
example, with x and y initially zero:
// Thread 1:
r1 = x.load(memory_order_relaxed);
if (r1 == 42) y.store(r1, memory_order_relaxed);
// Thread 2:
r2 = y.load(memory_order_relaxed);
if (r2 == 42) x.store(42, memory_order_relaxed);
However, implementations should not allow such behavior. – end note]
What part of the so called "memory model" protects non atomic objects from these interactions caused by reads seeing out-of-thin-air values?
When a race condition would exist with different values for x and y, what guarantees that read of a shared variable (normal, non atomic) cannot see such values?
Can not-executed if bodies create self-fulfilling conditions that lead to a data-race?
The text of your question seems to be missing the point of the example and out-of-thin-air values. Your example does not contain data-race UB. (It might if x or y were set to 42 before those threads ran, in which case all bets are off and the other answers citing data-race UB apply.)
There is no protection against real data races, only against out-of-thin-air values.
I think you're really asking how to reconcile that mo_relaxed example with sane and well-defined behaviour for non-atomic variables. That's what this answer covers.
The note is pointing out a hole in the atomic mo_relaxed formalism, not warning you of a real possible effect on some implementations.
This gap does not (I think) apply to non-atomic objects, only to mo_relaxed.
They say However, implementations should not allow such behavior. – end note]. Apparently the standards committee couldn't find a way to formalize that requirement so for now it's just a note, but is not intended to be optional.
It's clear that even though this isn't strictly normative, the C++ standard intends to disallow out-of-thin-air values for relaxed atomic (and in general I assume). Later standards discussion, e.g. 2018's p0668r5: Revising the C++ memory model (which doesn't "fix" this, it's an unrelated change) includes juicy side-nodes like:
We still do not have an acceptable way to make our informal (since C++14) prohibition of out-of-thin-air results precise. The primary practical effect of that is that formal verification of C++ programs using relaxed atomics remains unfeasible. The above paper suggests a solution similar to http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2013/n3710.html . We continue to ignore the problem here ...
So yes, the normative parts of the standard are apparently weaker for relaxed_atomic than they are for non-atomic. This seems to be an unfortunately side effect of how they define the rules.
AFAIK no implementations can produce out-of-thin-air values in real life.
Later versions of the standard phrase the informal recommendation more clearly, e.g. in the current draft: https://timsong-cpp.github.io/cppwp/atomics.order#8
Implementations should ensure that no “out-of-thin-air” values are computed that circularly depend on their own computation.
...
[ Note: The recommendation [of 8.] similarly disallows r1 == r2 == 42 in the following example, with x and y again initially zero:
// Thread 1:
r1 = x.load(memory_order::relaxed);
if (r1 == 42) y.store(42, memory_order::relaxed);
// Thread 2:
r2 = y.load(memory_order::relaxed);
if (r2 == 42) x.store(42, memory_order::relaxed);
— end note ]
(This rest of the answer was written before I was sure that the standard intended to disallow this for mo_relaxed, too.)
I'm pretty sure the C++ abstract machine does not allow r1 == r2 == 42.
Every possible ordering of operations in the C++ abstract machine operations leads to r1=r2=0 without UB, even without synchronization. Therefore the program has no UB and any non-zero result would violate the "as-if" rule.
Formally, ISO C++ allows an implementation to implement functions / programs in any way that gives the same result as the C++ abstract machine would. For multi-threaded code, an implementation can pick one possible abstract-machine ordering and decide that's the ordering that always happens. (e.g. when reordering relaxed atomic stores when compiling to asm for a strongly-ordered ISA. The standard as written even allows coalescing atomic stores but compilers choose not to). But the result of the program always has to be something the abstract machine could have produced. (Only the Atomics chapter introduces the possibility of one thread observing the actions of another thread without mutexes. Otherwise that's not possible without data-race UB).
I think the other answers didn't look carefully enough at this. (And neither did I when it was first posted). Code that doesn't execute doesn't cause UB (including data-race UB), and compilers aren't allowed to invent writes to objects. (Except in code paths that already unconditionally write them, like y = (x==42) ? 42 : y; which would obviously create data-race UB.)
For any non-atomic object, if don't actually write it then other threads might also be reading it, regardless of code inside not-executed if blocks. The standard allows this and doesn't allow a variable to suddenly read as a different value when the abstract machine hasn't written it. (And for objects we don't even read, like neighbouring array elements, another thread might even be writing them.)
Therefore we can't do anything that would let another thread temporarily see a different value for the object, or step on its write. Inventing writes to non-atomic objects is basically always a compiler bug; this is well known and universally agreed upon because it can break code that doesn't contain UB (and has done so in practice for a few cases of compiler bugs that created it, e.g. IA-64 GCC I think had such a bug at one point that broke the Linux kernel). IIRC, Herb Sutter mentioned such bugs in part 1 or 2 of his talk, atomic<> Weapons: The C++ Memory Model and Modern Hardware", saying that it was already usually considered a compiler bug before C++11, but C++11 codified that and made it easier to be sure.
Or another recent example with ICC for x86:
Crash with icc: can the compiler invent writes where none existed in the abstract machine?
In the C++ abstract machine, there's no way for execution to reach either y = r1; or x = r2;, regardless of sequencing or simultaneity of the loads for the branch conditions. x and y both read as 0 and neither thread ever writes them.
No synchronization is required to avoid UB because no order of abstract-machine operations leads to a data-race. The ISO C++ standard doesn't have anything to say about speculative execution or what happens when mis-speculation reaches code. That's because speculation is a feature of real implementations, not of the abstract machine. It's up to implementations (HW vendors and compiler writers) to ensure the "as-if" rule is respected.
It's legal in C++ to write code like if (global_id == mine) shared_var = 123; and have all threads execute it, as long as at most one thread actually runs the shared_var = 123; statement. (And as long as synchronization exists to avoid a data race on non-atomic int global_id). If things like this broke down, it would be chaos. For example, you could apparently draw wrong conclusions like reordering atomic operations in C++
Observing that a non-write didn't happen isn't data-race UB.
It's also not UB to run if(i<SIZE) return arr[i]; because the array access only happens if i is in bounds.
I think the "out of the blue" value-invention note only applies to relaxed-atomics, apparently as a special caveat for them in the Atomics chapter. (And even then, AFAIK it can't actually happen on any real C++ implementations, certainly not mainstream ones. At this point implementations don't have to take any special measures to make sure it can't happen for non-atomic variables.)
I'm not aware of any similar language outside the atomics chapter of the standard that allows an implementation to allow values to appear out of the blue like this.
I don't see any sane way to argue that the C++ abstract machine causes UB at any point when executing this, but seeing r1 == r2 == 42 would imply that unsynchronized read+write had happened, but that's data-race UB. If that can happen, can an implementation invent UB because of speculative execution (or some other reason)? The answer has to be "no" for the C++ standard to be usable at all.
For relaxed atomics, inventing the 42 out of nowhere wouldn't imply that UB had happened; perhaps that's why the standard says it's allowed by the rules? As far as I know, nothing outside the Atomics chapter of the standard allows it.
A hypothetical asm / hardware mechanism that could cause this
(Nobody wants this, hopefully everyone agrees that it would be a bad idea to build hardware like this. It seems unlikely that coupling speculation across logical cores would ever be worth the downside of having to roll back all cores when one detects a mispredict or other mis-speculation.)
For 42 to be possible, thread 1 has to see thread 2's speculative store and the store from thread 1 has to be seen by thread 2's load. (Confirming that branch speculation as good, allowing this path of execution to become the real path that was actually taken.)
i.e. speculation across threads: Possible on current HW if they ran on the same core with only a lightweight context switch, e.g. coroutines or green threads.
But on current HW, memory reordering between threads is impossible in that case. Out-of-order execution of code on the same core gives the illusion of everything happening in program order. To get memory reordering between threads, they need to be running on different cores.
So we'd need a design that coupled together speculation between two logical cores. Nobody does that because it means more state needs to rollback if a mispredict is detected. But it is hypothetically possible. For example an OoO SMT core that allows store-forwarding between its logical cores even before they've retired from the out-of-order core (i.e. become non-speculative).
PowerPC allows store-forwarding between logical cores for retired stores, meaning that threads can disagree about the global order of stores. But waiting until they "graduate" (i.e. retire) and become non-speculative means it doesn't tie together speculation on separate logical cores. So when one is recovering from a branch miss, the others can keep the back-end busy. If they all had to rollback on a mispredict on any logical core, that would defeat a significant part of the benefit of SMT.
I thought for a while I'd found an ordering that lead to this on single core of a real weakly-ordered CPUs (with user-space context switching between the threads), but the final step store can't forward to the first step load because this is program order and OoO exec preserves that.
T2: r2 = y; stalls (e.g. cache miss)
T2: branch prediction predicts that r2 == 42 will be true. ( x = 42 should run.
T2: x = 42 runs. (Still speculative; r2 = yhasn't obtained a value yet so ther2 == 42` compare/branch is still waiting to confirm that speculation).
a context switch to Thread 1 happens without rolling back the CPU to retirement state or otherwise waiting for speculation to be confirmed as good or detected as mis-speculation.
This part won't happen on real C++ implementations unless they use an M:N thread model, not the more common 1:1 C++ thread to OS thread. Real CPUs don't rename the privilege level: they don't take interrupts or otherwise enter the kernel with speculative instructions in flight that might need to rollback and redo entering kernel mode from a different architectural state.
T1: r1 = x; takes its value from the speculative x = 42 store
T1: r1 == 42 is found to be true. (Branch speculation happens here, too, not actually waiting for store-forwarding to complete. But along this path of execution, where the x = 42 did happen, this branch condition will execute and confirm the prediction).
T1: y = 42 runs.
this was all on the same CPU core so this y=42 store is after the r2=y load in program-order; it can't give that load a 42 to let the r2==42 speculation be confirmed. So this possible ordering doesn't demonstrate this in action after all. This is why threads have to be running on separate cores with inter-thread speculation for effects like this to be possible.
Note that x = 42 doesn't have a data dependency on r2 so value-prediction isn't required to make this happen. And the y=r1 is inside an if(r1 == 42) anyway so the compiler can optimize to y=42 if it wants, breaking the data dependency in the other thread and making things symmetric.
Note that the arguments about Green Threads or other context switch on a single core isn't actually relevant: we need separate cores for the memory reordering.
I commented earlier that I thought this might involve value-prediction. The ISO C++ standard's memory model is certainly weak enough to allow the kinds of crazy "reordering" that value-prediction can create to use, but it's not necessary for this reordering. y=r1 can be optimized to y=42, and the original code includes x=42 anyway so there's no data dependency of that store on the r2=y load. Speculative stores of 42 are easily possible without value prediction. (The problem is getting the other thread to see them!)
Speculating because of branch prediction instead of value prediction has the same effect here. And in both cases the loads need to eventually see 42 to confirm the speculation as correct.
Value-prediction doesn't even help make this reordering more plausible. We still need inter-thread speculation and memory reordering for the two speculative stores to confirm each other and bootstrap themselves into existence.
ISO C++ chooses to allow this for relaxed atomics, but AFAICT is disallows this non-atomic variables. I'm not sure I see exactly what in the standard does allow the relaxed-atomic case in ISO C++ beyond the note saying it's not explicitly disallowed. If there was any other code that did anything with x or y then maybe, but I think my argument does apply to the relaxed atomic case as well. No path through the source in the C++ abstract machine can produce it.
As I said, it's not possible in practice AFAIK on any real hardware (in asm), or in C++ on any real C++ implementation. It's more of an interesting thought-experiment into crazy consequences of very weak ordering rules, like C++'s relaxed-atomic. (Those ordering rules don't disallow it, but I think the as-if rule and the rest of the standard does, unless there's some provision that allows relaxed atomics to read a value that was never actually written by any thread.)
If there is such a rule, it would only be for relaxed atomics, not for non-atomic variables. Data-race UB is pretty much all the standard needs to say about non-atomic vars and memory ordering, but we don't have that.
When a race condition potentially exists, what guarantees that a read of a shared variable (normal, non atomic) cannot see a write
There is no such guarantee.
When race condition exists, the behaviour of the program is undefined:
[intro.races]
Two actions are potentially concurrent if
they are performed by different threads, or
they are unsequenced, at least one is performed by a signal handler, and they are not both performed by the same signal handler invocation.
The execution of a program contains a data race if it contains two potentially concurrent conflicting actions, at least one of which is not atomic, and neither happens before the other, except for the special case for signal handlers described below. Any such data race results in undefined behavior. ...
The special case is not very relevant to the question, but I'll include it for completeness:
Two accesses to the same object of type volatile std::sig_­atomic_­t do not result in a data race if both occur in the same thread, even if one or more occurs in a signal handler. ...
What part of the so called "memory model" protects non atomic objects from these interactions caused by reads that see the interaction?
None. In fact, you get the opposite and the standard explicitly calls this out as undefined behavior. In [intro.races]\21 we have
The execution of a program contains a data race if it contains two potentially concurrent conflicting actions, at least one of which is not atomic, and neither happens before the other, except for the special case for signal handlers described below. Any such data race results in undefined behavior.
which covers your second example.
The rule is that if you have shared data in multiple threads, and at least one of those threads write to that shared data, then you need synchronization. Without that you have a data race and undefined behavior. Do note that volatile is not a valid synchronization mechanism. You need atomics/mutexs/condition variables to protect shared access.
Note: The specific examples I give here are apparently not accurate. I've assumed the optimizer can be somewhat more aggressive than it's apparently allowed to be. There is some excellent discussion about this in the comments. I'm going to have to investigate this further, but wanted to leave this note here as a warning.
Other people have given you answers quoting the appropriate parts of the standard that flat out state that the guarantee you think exists, doesn't. It appears that you're interpreting a part of the standard that says a certain weird behavior is permitted for atomic objects if you use memory_order_relaxed as meaning that this behavior is not permitted for non-atomic objects. This is a leap of inference that is explicitly addressed by other parts of the standard that declare the behavior undefined for non-atomic objects.
In practical terms, here is an order of events that might happen in thread 1 that would be perfectly reasonable, but result in the behavior you think is barred even if the hardware guaranteed that all memory access was completely serialized between CPUs. Keep in mind that the standard has to not only take into account the behavior of the hardware, but the behavior of optimizers, which often aggressively re-order and re-write code.
Thread 1 could be re-written by an optimizer to look this way:
old_y = y; // old_y is a hidden variable (perhaps a register) created by the optimizer
y = 42;
if (x != 42) y = old_y;
There might be perfectly reasonable reasons for an optimizer to do this. For example, it may decide that it's far more likely than not for 42 to be written into y, and for dependency reasons, the pipeline might work a lot better if the store into y occurs sooner rather than later.
The rule is that the apparent result must look as if the code you wrote is what was executed. But there is no requirement that the code you write bears any resemblance at all to what the CPU is actually told to do.
The atomic variables impose constraints on the ability of the compiler to re-write code as well as instructing the compiler to issue special CPU instructions that impose constraints on the ability of the CPU to re-order memory accesses. The constraints involving memory_order_relaxed are much stronger than what is ordinarily allowed. The compiler would generally be allowed to completely get rid of any reference to x and y at all if they weren't atomic.
Additionally, if they are atomic, the compiler must ensure that other CPUs see the entire variable as either with the new value or the old value. For example, if the variable is a 32-bit entity that crosses a cache line boundary and a modification involves changing bits on both sides of the cache line boundary, one CPU may see a value of the variable that is never written because it only sees an update to the bits on one side of the cache line boundary. But this is not allowed for atomic variables modified with memory_order_relaxed.
That is why data races are labeled as undefined behavior by the standard. The space of the possible things that could happen is probably a lot wilder than your imagination could account for, and certainly wider than any standard could reasonably encompass.
(Stackoverflow complains about too many comments I put above, so I gathered them into an answer with some modifications.)
The intercept you cite from from C++ standard working draft N3337 was wrong.
[Note: The requirements do allow r1 == r2 == 42 in the following
example, with x and y initially zero:
// Thread 1:
r1 = x.load(memory_order_relaxed);
if (r1 == 42)
y.store(r1, memory_order_relaxed);
// Thread 2:
r2 = y.load(memory_order_relaxed);
if (r2 == 42)
x.store(42, memory_order_relaxed);
A programming language should never allow this "r1 == r2 == 42" to happen.
This has nothing to do with memory model. This is required by causality, which is the basic logic methodology and the foundation of any programming language design. It is the fundamental contract between human and computer. Any memory model should abide by it. Otherwise it is a bug.
The causality here is reflected by the intra-thread dependences between operations within a thread, such as data dependence (e.g., read after write in same location) and control dependence (e.g., operation in a branch), etc. They cannot be violated by any language specification. Any compiler/processor design should respect the dependence in its committed result (i.e., externally visible result or program visible result).
Memory model is mainly about memory operation ordering among multi-processors, which should never violate the intra-thread dependence, although a weak model may allow the causality happening in one processor to be violated (or unseen) in another processor.
In your code snippet, both threads have (intra-thread) data dependence (load->check) and control dependence (check->store) that ensure their respective executions (within a thread) are ordered. That means, we can check the later op's output to determine if the earlier op has executed.
Then we can use simple logic to deduce that, if both r1 and r2 are 42, there must be a dependence cycle, which is impossible, unless you remove one condition check, which essentially breaks the dependence cycle. This has nothing to do with memory model, but intra-thread data dependence.
Causality (or more accurately, intra-thread dependence here) is defined in C++ std, but not so explicitly in early drafts, because dependence is more of micro-architecture and compiler terminology. In language spec, it is usually defined as operational semantics. For example, the control dependence formed by "if statement" is defined in the same version of draft you cited as "If the condition yields true the first substatement is executed. " That defines the sequential execution order.
That said, the compiler and processor can schedule one or more operations of the if-branch to be executed before the if-condition is resolved. But no matter how the compiler and processor schedule the operations, the result of the if-branch cannot be committed (i.e., become visible to the program) before the if-condition is resolved. One should distinguish between semantics requirement and implementation details. One is language spec, the other is how the compiler and processor implement the language spec.
Actually the current C++ standard draft has corrected this bug in https://timsong-cpp.github.io/cppwp/atomics.order#9 with a slight change.
[ Note: The recommendation similarly disallows r1 == r2 == 42 in the following example, with x and y again initially zero:
// Thread 1:
r1 = x.load(memory_order_relaxed);
if (r1 == 42)
y.store(42, memory_order_relaxed);
// Thread 2:
r2 = y.load(memory_order_relaxed);
if (r2 == 42)
x.store(42, memory_order_relaxed);

do integer reads need to be critical section protected?

I have come across C++03 some code that takes this form:
struct Foo {
int a;
int b;
CRITICAL_SECTION cs;
}
// DoFoo::Foo foo_;
void DoFoo::Foolish()
{
if( foo_.a == 4 )
{
PerformSomeTask();
EnterCriticalSection(&foo_.cs);
foo_.b = 7;
LeaveCriticalSection(&foo_.cs);
}
}
Does the read from foo_.a need to be protected? e.g.:
void DoFoo::Foolish()
{
EnterCriticalSection(&foo_.cs);
int a = foo_.a;
LeaveCriticalSection(&foo_.cs);
if( a == 4 )
{
PerformSomeTask();
EnterCriticalSection(&foo_.cs);
foo_.b = 7;
LeaveCriticalSection(&foo_.cs);
}
}
If so, why?
Please assume the integers are 32-bit aligned. The platform is ARM.
Technically yes, but no on many platforms. First, let us assume that int is 32 bits (which is pretty common, but not nearly universal).
It is possible that the two words (16 bit parts) of a 32 bit int will be read or written to separately. On some systems, they will be read separately if the int isn't aligned properly.
Imagine a system where you can only do 32-bit aligned 32 bit reads and writes (and 16-bit aligned 16 bit reads and writes), and an int that straddles such a boundary. Initially the int is zero (ie, 0x00000000)
One thread writes 0xBAADF00D to the int, the other reads it "at the same time".
The writing thread first writes 0xBAAD to the high word of the int. The reader thread then reads the entire int (both high and low) getting 0xBAAD0000 -- which is a state that the int was never put into on purpose!
The writer thread then writes the low word 0xF00D.
As noted, on some platforms all 32 bit reads/writes are atomic, so this isn't a concern. There are other concerns, however.
Most lock/unlock code includes instructions to the compiler to prevent reordering across the lock. Without that prevention of reordering, the compiler is free to reorder things so long as it behaves "as-if" in a single threaded context it would have worked that way. So if you read a then b in code, the compiler could read b before it reads a, so long as it doesn't see an in-thread opportunity for b to be modified in that interval.
So possibly the code you are reading is using these locks to make sure that the read of the variable happens in the order written in the code.
Other issues are raised in the comments below, but I don't feel competent to address them: cache issues, and visibility.
Looking at this it seems that arm has quite relaxed memory model so you need a form of memory barrier to ensure that writes in one thread are visible when you'd expect them in another thread. So what you are doing, or else using std::atomic seems likely necessary on your platform. Unless you take this into account you can see updates out of order in different threads which would break your example.
I think you can use C++11 to ensure that integer reads are atomic, using (for example) std::atomic<int>.
The C++ standard says that there's a data race if one thread writes to a variable at the same time as another thread reads from that variable, or if two threads write to the same variable at the same time. It further says that a data race produces undefined behavior. So, formally, you must synchronize those reads and writes.
There are three separate issues when one thread reads data that was written by another thread. First, there is tearing: if writing requires more than a single bus cycle, it's possible for a thread switch to occur in the middle of the operation, and another thread could see a half-written value; there's an analogous problem if a read requires more than a single bus cycle. Second, there's visibility: each processor has its own local copy of the data that it's been working on recently, and writing to one processor's cache does not necessarily update another processor's cache. Third, there's compiler optimizations that reorder reads and writes in ways that would be okay within a single thread, but will break multi-threaded code. Thread-safe code has to deal with all three problems. That's the job of synchronization primitives: mutexes, condition variables, and atomics.
Although the integer read/write operation indeed will most likely be atomic, the compiler optimizations and processor cache will still give you problems if you don't do it properly.
To explain - the compiler will normally assume that the code is single-threaded and make many optimizations that rely on that. For example, it might change the order of instructions. Or, if it sees that the variable is only written and never read, it might optimize it away entirely.
The CPU will also cache that integer, so if one thread writes it, the other one might not get to see it until a lot later.
There are two things you can do. One is to wrap in in critical section like in your original code. The other is to mark the variable as volatile. That will signal the compiler that this variable will be accessed by multiple threads and will disable a range of optimizations, as well as placing special cache-sync instructions (aka "memory barriers") around accesses to the variable (or so I understand). Apparently this is wrong.
Added: Also, as noted by another answer, Windows has Interlocked APIs that can be used to avoid these issues for non-volatile variables.

"pseudo-atomic" operations in C++

So I'm aware that nothing is atomic in C++. But I'm trying to figure out if there are any "pseudo-atomic" assumptions I can make. The reason is that I want to avoid using mutexes in some simple situations where I only need very weak guarantees.
1) Suppose I have globally defined volatile bool b, which
initially I set true. Then I launch a thread which executes a loop
while(b) doSomething();
Meanwhile, in another thread, I execute b=true.
Can I assume that the first thread will continue to execute? In other words, if b starts out as true, and the first thread checks the value of b at the same time as the second thread assigns b=true, can I assume that the first thread will read the value of b as true? Or is it possible that at some intermediate point of the assignment b=true, the value of b might be read as false?
2) Now suppose that b is initially false. Then the first thread executes
bool b1=b;
bool b2=b;
if(b1 && !b2) bad();
while the second thread executes b=true. Can I assume that bad() never gets called?
3) What about an int or other builtin types: suppose I have volatile int i, which is initially (say) 7, and then I assign i=7. Can I assume that, at any time during this operation, from any thread, the value of i will be equal to 7?
4) I have volatile int i=7, and then I execute i++ from some thread, and all other threads only read the value of i. Can I assume that i never has any value, in any thread, except for either 7 or 8?
5) I have volatile int i, from one thread I execute i=7, and from another I execute i=8. Afterwards, is i guaranteed to be either 7 or 8 (or whatever two values I have chosen to assign)?
There are no threads in standard C++, and Threads cannot be implemented as a library.
Therefore, the standard has nothing to say about the behaviour of programs which use threads. You must look to whatever additional guarantees are provided by your threading implementation.
That said, in threading implementations I've used:
(1) yes, you can assume that irrelevant values aren't written to variables. Otherwise the whole memory model goes out the window. But be careful that when you say "another thread" never sets b to false, that means anywhere, ever. If it does, that write could perhaps be re-ordered to occur during your loop.
(2) no, the compiler can re-order the assignments to b1 and b2, so it is possible for b1 to end up true and b2 false. In such a simple case I don't know why it would re-order, but in more complex cases there might be very good reasons.
[Edit: oops, by the time I got to answering (2) I'd forgotten that b was volatile. Reads from a volatile variable won't be re-ordered, sorry, so yes on a typical threading implementation (if there is any such thing), you can assume that you won't end up with b1 true and b2 false.]
(3) same as 1. volatile in general has nothing to do with threading at all. However, it is quite exciting in some implementations (Windows), and might in effect imply memory barriers.
(4) on an architecture where int writes are atomic yes, although volatile has nothing to do with it. See also...
(5) check the docs carefully. Likely yes, and again volatile is irrelevant, because on almost all architectures int writes are atomic. But if int write is not atomic, then no (and no for the previous question), even if it's volatile you could in principle get a different value. Given those values 7 and 8, though, we're talking a pretty weird architecture for the byte containing the relevant bits to be written in two stages, but with different values you could more plausibly get a partial write.
For a more plausible example, suppose that for some bizarre reason you have a 16 bit int on a platform where only 8bit writes are atomic. Odd, but legal, and since int must be at least 16 bits you can see how it could come about. Suppose further that your initial value is 255. Then increment could legally be implemented as:
read the old value
increment in a register
write the most significant byte of the result
write the least significant byte of the result.
A read-only thread which interrupted the incrementing thread between the third and fourth steps of that, could see the value 511. If the writes are in the other order, it could see 0.
An inconsistent value could be left behind permanently if one thread is writing 255, another thread is concurrently writing 256, and the writes get interleaved. Impossible on many architectures, but to know that this won't happen you need to know at least something about the architecture. Nothing in the C++ standard forbids it, because the C++ standard talks about execution being interrupted by a signal, but otherwise has no concept of execution being interrupted by another part of the program, and no concept of concurrent execution. That's why threads aren't just another library - adding threads fundamentally changes the C++ execution model. It requires the implementation to do things differently, as you'll eventually discover if for example you use threads under gcc and forget to specify -pthreads.
The same could happen on a platform where aligned int writes are atomic, but unaligned int writes are permitted and not atomic. For example IIRC on x86, unaligned int writes are not guaranteed atomic if they cross a cache line boundary. x86 compilers will not mis-align a declared int variable, for this reason and others. But if you play games with structure packing you could probably provoke an example.
So: pretty much any implementation will give you the guarantees you need, but might do so in quite a complicated way.
In general, I've found that it is not worth trying to rely on platform-specific guarantees about memory access, that I don't fully understand, in order to avoid mutexes. Use a mutex, and if that's too slow use a high-quality lock-free structure (or implement a design for one) written by someone who really knows the architecture and compiler. It will probably be correct, and subject to correctness will probably outperform anything I invent myself.
Most of the answers correctly address the CPU memory ordering issues you're going to experience, but none have detailed how the compiler can thwart your intentions by re-ordering your code in ways that break your assumptions.
Consider an example taken from this post:
volatile int ready;
int message[100];
void foo(int i)
{
message[i/10] = 42;
ready = 1;
}
At -O2 and above, recent versions of GCC and Intel C/C++ (don't know about VC++) will do the store to ready first, so it can be overlapped with computation of i/10 (volatile does not save you!):
leaq _message(%rip), %rax
movl $1, _ready(%rip) ; <-- whoa Nelly!
movq %rsp, %rbp
sarl $2, %edx
subl %edi, %edx
movslq %edx,%rdx
movl $42, (%rax,%rdx,4)
This isn't a bug, it's the optimizer exploiting CPU pipelining. If another thread is waiting on ready before accessing the contents of message then you have a nasty and obscure race.
Employ compiler barriers to ensure your intent is honored. An example that also exploits the relatively strong ordering of x86 are the release/consume wrappers found in Dmitriy Vyukov's Single-Producer Single-Consumer queue posted here:
// load with 'consume' (data-dependent) memory ordering
// NOTE: x86 specific, other platforms may need additional memory barriers
template<typename T>
T load_consume(T const* addr)
{
T v = *const_cast<T const volatile*>(addr);
__asm__ __volatile__ ("" ::: "memory"); // compiler barrier
return v;
}
// store with 'release' memory ordering
// NOTE: x86 specific, other platforms may need additional memory barriers
template<typename T>
void store_release(T* addr, T v)
{
__asm__ __volatile__ ("" ::: "memory"); // compiler barrier
*const_cast<T volatile*>(addr) = v;
}
I suggest that if you are going to venture into the realm of concurrent memory access, use a library that will take care of these details for you. While we all wait for n2145 and std::atomic check out Thread Building Blocks' tbb::atomic or the upcoming boost::atomic.
Besides correctness, these libraries can simplify your code and clarify your intent:
// thread 1
std::atomic<int> foo; // or tbb::atomic, boost::atomic, etc
foo.store(1, std::memory_order_release);
// thread 2
int tmp = foo.load(std::memory_order_acquire);
Using explicit memory ordering, foo's inter-thread relationship is clear.
May be this thread is ancient, but the C++ 11 standard DOES have a thread library and also a vast atomic library for atomic operations. The purpose is specifically for concurrency support and avoid data races.
The relevant header is atomic
It's generally a really, really bad idea to depend on this, as you could end up with bad things happening and only one some architectures. The best solution would be to use a guaranteed atomic API, for example the Windows Interlocked api.
If your C++ implementation supplies the library of atomic operations specified by n2145 or some variant thereof, you can presumably rely on it. Otherwise, you cannot in general rely on "anything" about atomicity at the language level, since multitasking of any kind (and therefore atomicity, which deals with multitasking) is not specified by the existing C++ standard.
Volatile in C++ do not plays the same role than in Java. All the cases are undefined behavior as Steve saids. Some cases can be Ok for a compiler, oa given processor architecture and with a multi-threading system, but switching the optimization flags can make your program behave differently, as the C++03 compilers don't know about threads.
C++0x defines the rules that avoid race conditions and the operations that help you to master that, but to may knowledge there is no yet a compiler that implement yet all the part of the standard related to this subject.
My answer is going to be frustrating: No, No, No, No, and No.
1-4) The compiler is allowed to do ANYTHING it pleases with a variable it writes to. It may store temporary values in it, so long as ends up doing something that would do the same thing as that thread executing in a vacuum. ANYTHING is valid
5) Nope, no guarantee. If a variable is not atomic, and you write to it on one thread, and read or write to it on another, it is a race case. The spec declares such race cases to be undefined behavior, and absolutely anything goes. That being said, you will be hard pressed to find a compiler that does not give you 7 or 8, but it IS legal for a compiler to give you something else.
I always refer to this highly comical explanation of race cases.
http://software.intel.com/en-us/blogs/2013/01/06/benign-data-races-what-could-possibly-go-wrong

Are C++ Reads and Writes of an int Atomic?

I have two threads, one updating an int and one reading it. This is a statistic value where the order of the reads and writes is irrelevant.
My question is, do I need to synchronize access to this multi-byte value anyway? Or, put another way, can part of the write be complete and get interrupted, and then the read happen.
For example, think of a value = 0x0000FFFF that gets incremented value of 0x00010000.
Is there a time where the value looks like 0x0001FFFF that I should be worried about? Certainly the larger the type, the more possible something like this to happen.
I've always synchronized these types of accesses, but was curious what the community thinks.
Boy, what a question. The answer to which is:
Yes, no, hmmm, well, it depends
It all comes down to the architecture of the system. On an IA32 a correctly aligned address will be an atomic operation. Unaligned writes might be atomic, it depends on the caching system in use. If the memory lies within a single L1 cache line then it is atomic, otherwise it's not. The width of the bus between the CPU and RAM can affect the atomic nature: a correctly aligned 16bit write on an 8086 was atomic whereas the same write on an 8088 wasn't because the 8088 only had an 8 bit bus whereas the 8086 had a 16 bit bus.
Also, if you're using C/C++ don't forget to mark the shared value as volatile, otherwise the optimiser will think the variable is never updated in one of your threads.
At first one might think that reads and writes of the native machine size are atomic but there are a number of issues to deal with including cache coherency between processors/cores. Use atomic operations like Interlocked* on Windows and the equivalent on Linux. C++0x will have an "atomic" template to wrap these in a nice and cross-platform interface. For now if you are using a platform abstraction layer it may provide these functions. ACE does, see the class template ACE_Atomic_Op.
IF you're reading/writing 4-byte value AND it is DWORD-aligned in memory AND you're running on the I32 architecture, THEN reads and writes are atomic.
Yes, you need to synchronize accesses. In C++0x it will be a data race, and undefined behaviour. With POSIX threads it's already undefined behaviour.
In practice, you might get bad values if the data type is larger than the native word size. Also, another thread might never see the value written due to optimizations moving the read and/or write.
You must synchronize, but on certain architectures there are efficient ways to do it.
Best is to use subroutines (perhaps masked behind macros) so that you can conditionally replace implementations with platform-specific ones.
The Linux kernel already has some of this code.
On Windows, Interlocked***Exchange***Add is guaranteed to be atomic.
To echo what everyone said upstairs, the language pre-C++0x cannot guarantee anything about shared memory access from multiple threads. Any guarantees would be up to the compiler.
No, they aren't (or at least you can't assume they are). Having said that, there are some tricks to do this atomically, but they typically aren't portable (see Compare-and-swap).
I agree with many and especially Jason. On windows, one would likely use InterlockedAdd and its friends.
Asside from the cache issue mentioned above...
If you port the code to a processor with a smaller register size it will not be atomic anymore.
IMO, threading issues are too thorny to risk it.
Lets take this example
int x;
x++;
x=x+5;
The first statement is assumed to be atomic because it translates to a single INC assembly directive that takes a single CPU cycle. However, the second assignment requires several operations so it's clearly not an atomic operation.
Another e.g,
x=5;
Again, you have to disassemble the code to see what exactly happens here.
tc,
I think the moment you use a constant ( like 6) , the instruction wouldn't be completed in one machine cycle.
Try to see the instruction set of x+=6 as compared to x++
Some people think that ++c is atomic, but have a eye on the assembly generated. For example with 'gcc -S' :
movl cpt.1586(%rip), %eax
addl $1, %eax
movl %eax, cpt.1586(%rip)
To increment an int, the compiler first load it into a register, and stores it back into the memory. This is not atomic.
Definitively NO !
That answer from our highest C++ authority, M. Boost:
Operations on "ordinary" variables are not guaranteed to be atomic.
The only portable way is to use the sig_atomic_t type defined in signal.h header for your compiler. In most C and C++ implementations, that is an int. Then declare your variable as "volatile sig_atomic_t."
Reads and writes are atomic, but you also need to worry about the compiler re-ordering your code. Compiler optimizations may violate happens-before relationship of statements in your code. By using atomic you don't have to worry about that.
...
atomic i;
soap_status = GOT_RESPONSE ;
i = 1
In the above example, the variable 'i' will only be set to 1 after we get a soap response.