why C/C++ compiler not always make ++a atomic?

why C/C++ compiler not always make ++a atomic? - c++

As the title, when we write ++a in C/C++, it seems the compiler may compile it as:
inc dword ptr[i]
which is atomic,
or:
mov eax, dword ptr[i]
inc eax
mov dword ptr[i], eax
which is not atomic.
Is there any advantage to compile it as non-atomic style?

What if your code looks like this?
++a;
if (a > 1) {
...
}
If the compiler uses the first representation, it accesses memory to increment a, then it accesses memory again to compare to 1. In the second case, it accesses memory to get the value once and puts it in eax. Then it simply compares the register eax against 1, which is significantly quicker.

First, you seem to have a very particular family of processors in mind. Not all have an instruction that acts directly on memory.
Even if they have, a single instruction of that kind can be a very complex and costly thing. If it is really atomic as you claim, it has to stop all other bus transfers. This slows computation down to the speed of the memory bus. This is usually orders slower than the CPU.

The non-atomic variant ends up the the value in a register ready for later use.

Related

C/С++. Why could a simple integer addition on a volatile be translated to a different asm instruction on gcc and clang?

I wrote a simple loop:
int volatile value = 0;
void loop(int limit) {
for (int i = 0; i < limit; ++i) {
++value;
}
}
I compiled this with gcc and clang(-O3 -fno-unroll-loops) and got different outputs. They differ in ++value part:
clang:
add dword ptr [rip + value], 1 # ++value
add edi, -1 # --limit
jne .LBB0_1 # if limit > 0 then continue looping
gcc:
mov eax, DWORD PTR value[rip] # copy value to a register
add edx, 1 # ++i
add eax, 1 # increment a copy of value
mov DWORD PTR value[rip], eax # store incremented copy to value, i. e. ++value
cmp edi, edx # compare i < limit
jne .L3 # if i < limit then continue looping
C and C++ versions are same on each compiler(https://gcc.godbolt.org/z/5x5jGP)
So my questions are:
1) Is gcc doing something wrong? What is the point of copying the value?
2) I have benchmarked that code and for some reason the profiler shows that in gcc's version 73% of time is wasted on instruction add edx, 1, 13% on mov DWORD PTR value[rip], eax and 13% on cmp edi, edx. Am I interpreting this results wrong? Why other addition and move instructions take less than 1% of the time?
3) Why can performance differ on gcc/clang in such a primitive code?

This is all because you used volatile and GCC doesn't optimize it as aggressively
Without volatile, e.g. for a single ++*int_ptr, you get a memory-destination add. (And hopefully not inc when tuning for Intel CPUs; inc reg is fine but inc mem costs an extra uop vs. add 1. Unfortunately gcc and clang both get this wrong and use inc mem with -march=skylake: https://godbolt.org/z/_1Ri20)
clang knows that it can fold the volatile read / write accesses into the load and store portions of a memory-destination add.
GCC does not know how to do this optimization for volatile. Using volatile in GCC typically results in separate mov loads and stores, avoiding x86's ability to save code-size by using CISC memory operands for ALU instructions. On a load/store machine (like any RISC) you'd need separate load and store instructions anyway so it would be non-issue.
TL:DR: different compiler internals around volatile, specifically a GCC missed-optimization.
This missed optimization barely matter because volatile is rarely used. But feel free to report it on GCC's bugzilla if you want.
Without volatile, the loop would of course optimize away. But you can see a single memory-destination add from GCC or clang for a function that just does ++*p.
1) Is gcc doing something wrong? What is the point of copying the value?
It's only copying it to a register. We don't normally call this "copying", just bringing it into a register where it can operate on it.
Note that gcc and clang also differ in how they implement the loop condition, with clang optimizing to just dec/jnz (actually add -1, but it would use dec with -march=skylake or something with efficient dec, i.e. not Silvermont).
GCC spends an extra uop on the loop condition (on Intel CPUs where add/jnz can macro-fuse into a single uop). IDK why it compiles it naively like that.
73% of time is wasted on instruction add edx, 1
perf counters typically blame the instruction that's waiting for a slow result, not the instruction that's actually slow to produce it.
add edx,1 is waiting for the reload of value. With 4 to 5 cycle store-forwarding latency, this is the major bottleneck in your loop.
(Whether it's between the multiple uops of a memory-destination add or between separate instructions makes essentially no difference. There are no other memory accesses in your loop so none of the weird effects of store-forwarding latency being lower if you don't try too soon come into play:
Adding a redundant assignment speeds up code when compiled without optimization or Loop with function call faster than an empty loop )
Why other addition and move instructions take less than 1% of the time?
Because out-of-order execution hides them under the latency of the critical path. They are very rarely the instruction that gets blamed when statistical sampling has to pick one out of the many that are in flight at once in any given cycle.
3) Why can performance differ on gcc/clang in such a primitive code?
I'd expect both those loops run at the same speed. Did you just mean performance as in how well the compilers themselves performed in making code that's both fast and compact?

C/C++: relaxed std::atomic<bool> vs unlocked bool on X64 architecture

Is there any efficency benefit to using an unlocked boolean over using an std::atomic<bool> where the operations are always done with relaxed memory order? I would assume that both eventually compile to the same machine code, since a single byte is actually atomic on X64 hardware. Am I wrong?

Yes, there's are potentially massive advantages, especially for local variables, or any variable used repeatedly in the same function. An atomic<> variable can't be optimized into a register.
If you compiled without optimization, the code-gen would be similar, but compiling with normal optimization enabled there can be massive differences. Un-optimized code is similar to making every variable volatile.
Current compilers also never combine multiple reads of an atomic variable into one, as if you'd used volatile atomic<T>, because that's what people expect and the dust hasn't settled yet on how to allow useful optimizations while prohibiting ones you don't want. (Why don't compilers merge redundant std::atomic writes? and Can and does the compiler optimize out two atomic loads?).
This isn't a great example, but imagine that checking the boolean is done inside an inlined function, and that there's something else inside the loop. (Otherwise you'd put the if around the loop like a normal person.)
int sumarr_atomic(int arr[]) {
int sum = 0;
for(int i=0 ; i<10000 ; i++) {
if (atomic_bool.load (std::memory_order_relaxed)) {
sum += arr[i];
}
}
return sum;
}
See the asm output on Godbolt.
But with a non-atomic bool, the compiler can make that transformation for you by hoisting the load, and then auto-vectorize the simple sum loop (or not run it at all).
With atomic_bool, it can't. With atomic_bool, the asm loop is much like the C++ source, actually doing a test and branch on the value of the variable inside every loop iteration. And this of course defeats auto-vectorization.
(The C++ as-if rules would allow the compiler to hoist the load because it's relaxed so it can reorder with non-atomic accesses. And merge because reading the same value every time is one possible result of a global order that reads one value. But as I said, compilers don't do that.)
Loops over an array of bool can auto-vectorize, but not over atomic<bool> [].
Also, inverting a boolean with something like b ^= 1; or b++ can be just a regular RMW, not atomic RMW, so it doesn't have to use lock xor or lock btc. (x86 atomic RMW is only possible with sequential-consistency vs. runtime reordering, i.e. the lock prefix is also a full memory barrier.)
Code that modifies a non-atomic boolean can optimize away the actual modifications, e.g.
void loop() {
for(int i=0 ; i<10000 ; i++) {
regular_bool ^= 1;
}
}
compiles to asm that keeps regular_bool in a register. Unfortunately it doesn't optimize away to nothing (which it could because flipping a boolean an even number of times sets it back to its original value). But it could with a smarter compiler.
loop():
movzx edx, BYTE PTR regular_bool[rip] # load into a register
mov eax, 10000
.L17: # do {
xor edx, 1 # flip the boolean
sub eax, 1
jne .L17 # } while(--i);
mov BYTE PTR regular_bool[rip], dl # store back the result
ret
Even if written as atomic_b.store( !atomic_b.load(mo_relaxed), mo_relaxed) (separate atomic loads/stores), you'd still get a store/reload in the loop, creating a 6-cycle loop-carried dependency chain through the store/reload (on Intel CPUs with 5-cycle store-forwarding latency) instead of a 1-cycle dep chain through a register.

Checking over at Godbolt, loading a regular bool and a std::atomic<bool> generate different code, although not because of synchronisation issues. Instead, the compiler (gcc) seems unwilling to assume that a std::atomic<bool> is guaranteed to be either 0 or 1. Strange, that.
Clang does the same thing, although the code generated is slightly different in detail.

VC++, /volatile:ms on x86

The documentation on volatile says:
When the /volatile:ms compiler option is used—by default when architectures other than ARM are targeted—the compiler generates extra code to maintain ordering among references to volatile objects in addition to maintaining ordering to references to other global objects.
What exact code could compile differently with /volatile:ms and /volatile:iso?

A complete understanding of this requires a bit of a history lesson. (And who doesn't like history?…says the guy who majored in history.) The /volatile:ms semantics were first added to the compiler with Visual Studio 2005. Starting with that version, variables marked volatile automatically imposed acquire semantics on reads, and release semantics on writes, through that variable.
What does this mean? It has to do with the memory model, and specifically, how aggressively the compiler is permitted to reorder memory-access operations. An operation that has acquire semantics prevents subsequent memory operations from being hoisted above it; an operation that has release semantics prevents preceding memory operations from being delayed until after it. As the names suggest, acquire semantics are typically used when you are acquiring a resource, whereas release semantics are typically used when you are releasing a resource. MSDN has a more complete description of acquire and release semantics; it says:
An operation has acquire semantics if other processors will always see
its effect before any subsequent operation's effect. An operation has
release semantics if other processors will see every preceding
operation's effect before the effect of the operation itself. Consider
the following code example:
a++;
b++;
c++;
From another processor's point of view, the preceding operations can
appear to occur in any order. For example, the other processor might
see the increment of b before the increment of a.
For example, the InterlockedIncrementAcquire routine uses acquire semantics to increment a variable. If you rewrote the preceding code example as follows:
InterlockedIncrementAcquire(&a);
b++;
c++;
other processors would always see the increment of a before the increments of b and c.
Likewise, the InterlockedIncrementRelease routine uses release semantics to increment a variable. If you rewrote the code example once again, as follows:
a++;
b++;
InterlockedIncrementRelease(&c);
other processors would always see the increments of a and b before the increment of c.
Now, like MSDN says, atomic operations have both acquire and release semantics. And, in fact, on x86, there is no way to give an instruction only acquire or release semantics, so achieving even one of these requires that the instruction is made atomic (which the compiler will generally do by emitting a LOCK CMPXCHG instruction).
Before Visual Studio 2005's enhancement of the volatile semantics, developers who wanted to write correct code needed to use the Interlocked* family of functions, as described in the MSDN article. Unfortunately, many developers failed to do this and got code that worked mostly by accident (or didn't work at all). But there was a good chance that it did work by accident, given the x86's relatively strict memory model. You often get the semantics you want for free since, on x86, most loads and stores already have acquire/release semantics, so you don't even need to make anything atomic. (Non-temporal stores are the obvious exception, but in this case, those wouldn't matter anyway.) I suspect this ease of implementation on x86 is what, combined with the realization that programmers generally failed to understand and do the right thing, persuaded Microsoft to strengthen the semantics of volatile in VS 2005.
Another potential reason for the change was the growing importance of multi-threaded code. 2005 was around the time that Pentium 4 chips with HyperThreading were beginning to become popular, effectively bringing simultaneous multi-threading to every users' desktop. Probably not coincidentally, VS 2005 also removed the option to link to single-threaded version of the C run-time libraries. It is when you have multi-threaded code—with the possibility of executing on multiple processors—that you really have to start worrying about getting the memory-access semantics correct.
With VS 2005 and later, you could just mark a pointer parameter as volatile and get the desired acquire semantics. The volatility implied/imposed the acquire semantics, which made multi-threaded code running in multi-processing environments safe. Prior to 2011, this was extremely important, since the C and C++ language standards had absolutely nothing to say about threading and gave you no portable way of writing correct code.
And this brings us right to the answer to your question. If your code assumes these extended semantics for volatile, then you need to pass the /volatile:ms switch to ensure that the compiler continues to apply them. If you have written C++11-style code that uses modern primitives for atomic, thread-safe operations, then don't need volatile to have these extended semantics and are safe passing /volatile:iso. In other words, as manni66 quipped, if your code "misuses volatile as std::atomic", then you will see a difference in behavior and need /volatile:ms to guarantee that volatile does have the same effect as std::atomic.
As it turns out, it has proven very difficult for me to find an example of a case where /volatile:iso actually changes the generated code, as compared to /volatile:ms. Microsoft's optimizer is actually very conservative with respect to reordering instructions, which is the type of thing that the acquire/release semantics are supposed to protect against.
Here's a simple example (where we're using a volatile global variable to guard a critical section, as you might find in a simplistic "lock-free" implementation) that should demonstrate the difference:
volatile bool CriticalSection;
int Data[100];
void FillData(int i)
{
Data[i] = 42; // fill data item at index 'i'
CriticalSection = false; // release critical section
}
If you compile this with GCC at -O2, it will generate the following machine code:
FillData(int):
mov eax, DWORD PTR [esp+4] // retrieve parameter 'i' from stack
mov BYTE PTR [CriticalSection], 0 // store '0' in 'CriticalSection'
mov DWORD PTR [Data+eax*4], 42 // store '42' at index 'i' in 'Data'
ret
Even if you aren't fluent in assembly language, you should be able to see that the optimizer has re-ordered the stores, such that the critical section is released (CriticalSection = false) before the data is filled in (Data[i] = 42)—precisely the opposite of the order in which the statements appeared in the original C code. The volatile is having no effect on this re-ordering, because GCC follows the ISO semantics, just like /volatile:iso will (in theory).
By the way, notice how…um…volatile :-) this ordering is. If we compile at -O1 in GCC, we get instructions that do everything in the same order as our original C code:
FillData(int):
mov eax, DWORD PTR [esp+4] // retrieve parameter 'i' from stack
mov DWORD PTR [Data+eax*4], 42 // store '42' at index 'i' in 'Data'
mov BYTE PTR [CriticalSection], 0 // store '0' in 'CriticalSection'
ret
When you start throwing more instructions in there for the compiler to rearrange, and especially if this code were to get inlined, you can imagine how unlikely it is that the original order is preserved.
But, like I said, MSVC is actually very conservative with regard to re-ordering instructions. Regardless of whether I specify /volatile:ms or /volatile:iso, I get the exactly same machine code:
FillData, COMDAT PROC
mov eax, DWORD PTR [esp+4]
mov DWORD PTR [Data+eax*4], 42
mov BYTE PTR [CriticalSection], 0
ret
FillData ENDP
where the stores are done in the original order. I've played with all sorts of different permutations, introducing additional variables and operations, all without being able to find the magic sequence that causes MSVC to re-order the stores. So, it is very likely that, currently, in practice, you won't see a very big difference with the /volatile:iso switch set when targeting x86 architectures. But that's a very loose guarantee, to say the least.
Note that this empirical observation is consistent with Alexander Gutenev's speculation that a difference in semantics is observed only on ARM, and that the whole reason these switches were introduced was to avoid paying a performance penalty on this newly-supported platform. Meanwhile, on the x86 side, there have been no actual changes to the semantics in generated code, since there is essentially no cost. (Save for some extremely trivial optimization possibilities, but that would require that their optimizer have two completely separate schedulers, which probably isn't a good use of developer time.)
The point is that, with /volatile:iso, MSVC is permitted to act like GCC and re-order the stores. With /volatile:ms, you are guaranteed that it won't because volatile implies acquire/release semantics for that variable.
Bonus Reading: So, what is volatile supposed to be used for, in strictly ISO-compliant code (i.e., when the /volatile:iso switch is used)? Well, volatile is basically meant for memory-mapped I/O. That's what it was originally meant for when it was first introduced, and it remains its principal purpose. I've heard it jokingly said that volatile is for reading/writing a tape drive. Basically, you mark the pointer volatile in order to prevent the compiler from optimizing reads and writes away. For example:
volatile char* pDeviceIOAddr = ...;
void Wait()
{
while (*pDeviceIOAddr)
{ }
}
Qualifying the parameter's type with volatile prevents the compiler from assuming that subsequent reads return the same value, forcing it to do a new read each time through the loop. In other words:
mov eax, DWORD PTR [pDeviceIoAddr] // get pointer
Wait:
cmp BYTE PTR [eax], 0 // dereference pointer, read 1 byte,
jnz Wait // and compare to 0
If pDeviceIoAddr wasn't volatile, the entire loop could have been elided. Optimizers definitely do this in practice, including MSVC. Or, you could get the following pathological code:
mov eax, DWORD PTR [pDeviceIoAddr] // get pointer
mov al, BYTE PTR [eax] // dereference pointer, read 1 byte
Wait:
cmp al, 0 // compare it to 0
jnz Wait
where the pointer is dereferenced once, outside of the loop, caching the byte in a register. The instruction at the top of the loop just tests that enregistered value, creating either no loop or an infinite loop. Oops.
Notice, however, that this use of volatile in ISO-standard C++ does not obviate the need for critical sections, mutexes, or other types of locks. Even the correct version of the above code wouldn't work correctly if another thread could potentially modify pDeviceIOAddr, since the read of that address/pointer does not have acquire semantics. Acquire semantics would look like this:
Wait:
mov eax, DWORD PTR [pDeviceIoAddr] // get pointer (acquire semantics)
cmp BYTE PTR [eax], 0 // dereference pointer, read 1 byte,
jnz Wait // and compare to 0
and to get that, you would need C++11's std::atomic.

I suspect it might start matter from some point in time, if not yet already.
There's undocumented option /volatileMetadata- vs /volatileMetadata. When it is on (default), some metadata is generated for emulation of x86 on ARM.
The information is from my issue report, some links from there:
Visual Studio 2019 version 16.10 Preview 2 enabled volatile metadata by default when targeting x64 to improve emulation performance
ARM64EC Support in Visual Studio
So, at least hypothetically, /volatile:iso may matter for volatile metadata, and affect x86 code execution on ARM.
I have no confirmation that it happens. By compiling the same binary with and without /volatileMetadata- I at least confirm by binary size that the mentioned metadata does exist.

How does memory barrier work?

Under Windows, there are three compiler-intrinsic functions to implement memory barrier:
1. _ReadBarrier;
2. _WriteBarrier;
3. _ReadWriteBarrier;
However, I found a weird problem: _ReadBarrier seems a dummy function doing nothing! The following is my assembly code generated by VC++ 2012.
My question is: How to implement a memory barrier function in assembly instructions?
int main()
{
013EEE10 push ebp
013EEE11 mov ebp,esp
013EEE13 sub esp,0CCh
013EEE19 push ebx
013EEE1A push esi
013EEE1B push edi
013EEE1C lea edi,[ebp-0CCh]
013EEE22 mov ecx,33h
013EEE27 mov eax,0CCCCCCCCh
013EEE2C rep stos dword ptr es:[edi]
int n = 0;
013EEE2E mov dword ptr [n],0
n = n + 1;
013EEE35 mov eax,dword ptr [n]
013EEE38 add eax,1
013EEE3B mov dword ptr [n],eax
_ReadBarrier();
n = n + 1;
013EEE3E mov eax,dword ptr [n]
013EEE41 add eax,1
013EEE44 mov dword ptr [n],eax
}
013EEE56 xor eax,eax
013EEE58 pop edi
013EEE59 pop esi
013EEE5A pop ebx
013EEE5B add esp,0CCh
013EEE61 cmp ebp,esp
013EEE63 call __RTC_CheckEsp (013EC3B0h)
013EEE68 mov esp,ebp
013EEE6A pop ebp
013EEE6B ret

_ReadBarrier, _WriteBarrier, and _ReadWriteBarrier are intrinsics that affect how the compiler can reorder code; they have absolutely nothing to do with CPU memory barriers and are only valid for specific kinds of memory (see "Affected Memory" here).
MemoryBarrier() is the intrinsic that you use to force a CPU memory barrier. However, the recommendation from Microsoft is to use std::atomic<T> going forward with VC++.

Modern processors are capable of executing instructions quite a long way ahead of where it actually is "completing" the instructions, so memory barriers are used to prevent it from running to far ahead when it comes to certain types of memory operations, where strict ordering is required - for most things, it doesn't actually matter if you write to variable a before variable b, or b before a. But sometimes it does.
The x86 instruction set has lfence, sfence and fence, which are instructions that "fence in" loads, stores and all memory operations respectively. The point about a "fence" or "barrier" instruction is to ensure that all the instructions that precede the barrier instruction has completed their loads, stores or both before the next instruction after the barrier can continue.
This is important if you are implementing for example semaphores, mutexes or similar instructions, since it's important to store the value saying "I've locked the semaphore" before you continue to read other data, for example. Otherwise things can go wrong, let's say.
Note that unless you REALLY know what you are doing with memory barriers, it's probably best to NOT use them - and rely on already existing code that solves the same problem - std::atomic are one place to fund such code. I have written quite a bit of "tricky" code, but only once or twice have I needed a memory barrier in my code.
Several times, I've needed to make the compiler not spread the code around, which you can do with "no-op functions", and apparently there are even special intrinsic functions these days to do that.

There are several important points to consider. Perhaps the
first is that barriers only have an effect in multithreaded
code, and most compilers require a special option to produce
multithreaded code. And things like _ReadBarrier are almost
certainly compiler built-ins, and should do nothing unless
you've given the options for multithreaded code.
The second is that what the hardware requires, even in a
multithreaded context, varies. On most of the machines I've
worked on (over some forty years), the machine never required
anything; barriers only become relevant if the machine has
sophisticated read and write pipelines. (Most earlier machines
didn't even have fence or barrier instructions, so the generated
code would have to be empty.)

Reading interlocked variables

Assume:
A. C++ under WIN32.
B. A properly aligned volatile integer incremented and decremented using InterlockedIncrement() and InterlockedDecrement().
__declspec (align(8)) volatile LONG _ServerState = 0;
If I want to simply read _ServerState, do I need to read the variable via an InterlockedXXX function?
For instance, I have seen code such as:
LONG x = InterlockedExchange(&_ServerState, _ServerState);
and
LONG x = InterlockedCompareExchange(&_ServerState, _ServerState, _ServerState);
The goal is to simply read the current value of _ServerState.
Can't I simply say:
if (_ServerState == some value)
{
// blah blah blah
}
There seems to be some confusion WRT this subject. I understand register-sized reads are atomic in Windows, so I would assume the InterlockedXXX function is unnecessary.
Matt J.
Okay, thanks for the responses. BTW, this is Visual C++ 2005 and 2008.
If it's true I should use an InterlockedXXX function to read the value of _ServerState, even if just for the sake of clarity, what's the best way to go about that?
LONG x = InterlockedExchange(&_ServerState, _ServerState);
This has the side effect of modifying the value, when all I really want to do is read it. Not only that, but there is a possibility that I could reset the flag to the wrong value if there is a context switch as the value of _ServerState is pushed on the stack in preparation of calling InterlockedExchange().
LONG x = InterlockedCompareExchange(&_ServerState, _ServerState, _ServerState);
I took this from an example I saw on MSDN.
See http://msdn.microsoft.com/en-us/library/ms686355(VS.85).aspx
All I need is something along the lines:
lock mov eax, [_ServerState]
In any case, the point, which I thought was clear, is to provide thread-safe access to a flag without incurring the overhead of a critical section. I have seen LONGs used this way via the InterlockedXXX() family of functions, hence my question.
Okay, we are thinking a good solution to this problem of reading the current value is:
LONG Cur = InterlockedCompareExchange(&_ServerState, 0, 0);

It depends on what you mean by "goal is to simply read the current value of _ServerState" and it depends on what set of tools and the platform you use (you specify Win32 and C++, but not which C++ compiler, and that may matter).
If you simply want to read the value such that the value is uncorrupted (ie., if some other processor is changing the value from 0x12345678 to 0x87654321 your read will get one of those 2 values and not 0x12344321) then simply reading will be OK as long as the variable is :
marked volatile,
properly aligned, and
read using a single instruction with a word size that the processor handles atomically
None of this is promised by the C/C++ standard, but Windows and MSVC do make these guarantees, and I think that most compilers that target Win32 do as well.
However, if you want your read to be synchronized with behavior of the other thread, there's some additional complexity. Say that you have a simple 'mailbox' protocol:
struct mailbox_struct {
uint32_t flag;
uint32_t data;
};
typedef struct mailbox_struct volatile mailbox;
// the global - initialized before wither thread starts
mailbox mbox = { 0, 0 };
//***************************
// Thread A
while (mbox.flag == 0) {
/* spin... */
}
uint32_t data = mbox.data;
//***************************
//***************************
// Thread B
mbox.data = some_very_important_value;
mbox.flag = 1;
//***************************
The thinking is Thread A will spin waiting for mbox.flag to indicate mbox.data has a valid piece of information. Thread B will write some data into mailbox.data then will set mbox.flag to 1 as a signal that mbox.data is valid.
In this case a simple read in Thread A of mbox.flag might get the value 1 even though a subsequent read of mbox.data in Thread A does not get the value written by Thread B.
This is because even though the compiler will not reorder the Thread B writes to mbox.data and mbox.flag, the processor and/or cache might. C/C++ guarantees that the compiler will generate code such that Thread B will write to mbox.data before it writes to mbox.flag, but the processor and cache might have a different idea - special handling called 'memory barriers' or 'acquire and release semantics' must be used to ensure ordering below the level of the thread's stream of instructions.
I'm not sure if compilers other than MSVC make any claims about ordering below the instruction level. However MS does guarantee that for MSVC volatile is enough - MS specifies that volatile writes have release semantics and volatile reads have acquire semantics - though I'm not sure at which version of MSVC this applies - see http://msdn.microsoft.com/en-us/library/12a04hfd.aspx?ppud=4.
I have also seen code like you describe that uses Interlocked APIs to perform simple reads and writes to shared locations. My take on the matter is to use the Interlocked APIs. Lock free inter-thread communication is full of very difficult to understand and subtle pitfalls, and trying to take a shortcut on a critical bit of code that may end up with a very difficult to diagnose bug doesn't seem like a good idea to me. Also, using an Interlocked API screams to anyone maintaining the code, "this is data access that needs to be shared or synchronized with something else - tread carefully!".
Also when using the Interlocked API you're taking the specifics of the hardware and the compiler out of the picture - the platform makes sure all of that stuff is dealt with properly - no more wondering...
Read Herb Sutter's Effective Concurrency articles on DDJ (which happen to be down at the moment, for me at least) for good information on this topic.

Your way is good:
LONG Cur = InterlockedCompareExchange(&_ServerState, 0, 0);
I'm using similar solution:
LONG Cur = InterlockedExchangeAdd(&_ServerState, 0);

Interlocked instructions provide atomicity and inter-processor synchronization. Both writes and reads must be synchronized, so yes, you should be using interlocked instructions to read a value that is shared between threads and not protected by a lock. Lock-free programming (and that's what you're doing) is a very tricky area, so you might consider using locks instead. Unless this is really one of your program's bottlenecks that must be optimized?

32-bit read operations are already atomic on some 32-bit systems (Intel spec says these operations are atomic, but there's no guarantee that this will be true on other x86-compatible platforms). So you shouldn't use this for threads synchronization.
If you need a flag some sort you should consider using Event object and WaitForSingleObject function for that purpose.

To anyone who has to revisit this thread I want to add to what was well explained by Bartosz that _InterlockedCompareExchange() is a good alternative to standard atomic_load() if standard atomics are not available. Here is the code for atomically reading my_uint32_t_var in C on i86 Win64. atomic_load() is included as a benchmark:
long debug_x64_i = std::atomic_load((const std::_Atomic_long *)&my_uint32_t_var);
00000001401A6955 mov eax,dword ptr [rbp+30h]
00000001401A6958 xor edi,edi
00000001401A695A mov dword ptr [rbp-0Ch],eax
debug_x64_i = _InterlockedCompareExchange((long*)&my_uint32_t_var, 0, 0);
00000001401A695D xor eax,eax
00000001401A695F lock cmpxchg dword ptr [rbp+30h],edi
00000001401A6964 mov dword ptr [rbp-0Ch],eax
debug_x64_i = _InterlockedOr((long*)&my_uint32_t_var, 0);
00000001401A6967 prefetchw [rbp+30h]
00000001401A696B mov eax,dword ptr [rbp+30h]
00000001401A696E xchg ax,ax
00000001401A6970 mov ecx,eax
00000001401A6972 lock cmpxchg dword ptr [rbp+30h],ecx
00000001401A6977 jne foo+30h (01401A6970h)
00000001401A6979 mov dword ptr [rbp-0Ch],eax
long release_x64_i = std::atomic_load((const std::_Atomic_long *)&my_uint32_t_var);
00000001401A6955 mov eax,dword ptr [rbp+30h]
release_x64_i = _InterlockedCompareExchange((long*)&my_uint32_t_var, 0, 0);
00000001401A6958 mov dword ptr [rbp-0Ch],eax
00000001401A695B xor edi,edi
00000001401A695D mov eax,dword ptr [rbp-0Ch]
00000001401A6960 xor eax,eax
00000001401A6962 lock cmpxchg dword ptr [rbp+30h],edi
00000001401A6967 mov dword ptr [rbp-0Ch],eax
release_x64_i = _InterlockedOr((long*)&my_uint32_t_var, 0);
00000001401A696A prefetchw [rbp+30h]
00000001401A696E mov eax,dword ptr [rbp+30h]
00000001401A6971 mov ecx,eax
00000001401A6973 lock cmpxchg dword ptr [rbp+30h],ecx
00000001401A6978 jne foo+31h (01401A6971h)
00000001401A697A mov dword ptr [rbp-0Ch],eax

you should be okay. It's volatile, so the optimizer shouldn't savage you, and it's a 32-bit value so it should be at least approximately atomic. The one possible surprise is if the instruction pipeline can get around that.
On the other hand, what's the additional cost of using the guarded routines?

Current value reading may not need any lock.

The Interlocked* functions prevent two different processors from accessing the same piece of memory. In a single processor system you are going to be ok. If you have a dual-core system where you have threads on different cores both accessing this value, you might have problems doing what you think is atomic without the Interlocked*.

Your initial understanding is basically correct. According to the memory model which Windows requires on all MP platforms it supports (or ever will support), reads from a naturally-aligned variable marked volatile are atomic as long as they are smaller than the size of a machine word. Same with writes. You don't need a 'lock' prefix.
If you do the reads without using an interlock, you are subject to processor reordering. This can even occur on x86, in a limited circumstance: reads from a variable may be moved above writes of a different variable. On pretty much every non-x86 architecture that Windows supports, you are subject to even more complicated reordering if you don't use explicit interlocks.
There's also a requirement that if you're using a compare exchange loop, you must mark the variable you're compare exchanging on as volatile. Here's a code example to demonstrate why:
long g_var = 0; // not marked 'volatile' -- this is an error
bool foo () {
long oldValue;
long newValue;
long retValue;
// (1) Capture the original global value
oldValue = g_var;
// (2) Compute a new value based on the old value
newValue = SomeTransformation(oldValue);
// (3) Store the new value if the global value is equal to old?
retValue = InterlockedCompareExchange(&g_var,
newValue,
oldValue);
if (retValue == oldValue) {
return true;
}
return false;
}
What can go wrong is that the compiler is well within its rights to re-fetch oldValue from g_var at any time if it's not volatile. This 'rematerialization' optimization is great in many cases because it can avoid spilling registers to the stack when register pressure is high.
Thus, step (3) of the function would become:
// (3) Incorrectly store new value regardless of whether the global
// is equal to old.
retValue = InterlockedCompareExchange(&g_var,
newValue,
g_var);

Read is fine. A 32-bit value is always read as a whole as long as it's not split on a cache line. Your align 8 guarantees that it's always within a cache line so you'll be fine.
Forget about instructions reordering and all that non-sense. Results are always retired in-order. It would be a processor recall otherwise!!!
Even for a dual CPU machine (i.e. shared via the slowest FSBs), you'll still be fine as the CPUs guarantee cache coherency via MESI Protocol. The only thing you're not guaranteed is the value you read may not be the absolute latest. BUT, what is the latest anyway? That's something you likely won't need to know in most situations if you're not writing back to the location based on the value of that read. Otherwise, you'd have used interlocked ops to handle it in the first place.
In short, you gain nothing by using Interlocked ops on a read (except perhaps reminding the next person maintaining your code to tread carefully - then again, that person may not be qualified to maintain your code to begin with).
EDIT: In response to a comment left by Adrian McCarthy.
You're overlooking the effect of compiler optimizations. If the
compiler thinks it has the value already in a register, then it's
going to re-use that value instead of re-reading it from memory. Also,
the compiler may do instruction re-ordering for optimization if it
believes there are no observable side effects.
I did not say reading from a non-volatile variable is fine. All the question was asking was if interlocked was required. In fact, the variable in question was clearly declared with volatile. Or were you overlooking the effect of the keyword volatile?

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js