Reporting a thread progress to main thread in C++ - c++

In C/C++ How can I make the threads(POSIX pthreads/Windows threads) to give me a safe method to pass progress back to the main thread on the progress of the execution or my work that I’ve decided to perform with the thread.
Is it possible to report the progress in terms of percentage ?

I'm going to assume a very simple case of a main thread, and one function. What I'd recommend is passing in a pointer to an atomic (as suggested by Kirill above) for each time you launch the thread. Assuming C++11 here.
using namespace std;
void threadedFunction(atomic<int>* progress)
{
for(int i = 0; i < 100; i++)
{
progress->store(i); // updates the variable safely
chrono::milliseconds dura( 2000 );
this_thread::sleep_for(dura); // Sleeps for a bit
}
return;
}
int main(int argc, char** argv)
{
// Make and launch 10 threads
vector<atomic<int>> atomics;
vector<thread> threads;
for(int i = 0; i < 10; i++)
{
atomics.emplace_back(0);
threads.emplace_back(threadedFunction, &atomics[i]);
}
// Monitor the threads down here
// use atomics[n].load() to get the value from the atomics
return 0;
}
I think that'll do what you want. I omitted polling the threads, but you get the idea. I'm passing in an object that both the main thread and the child thread know about (in this case the atomic<int> variable) that they both can update and/or poll for results. If you're not on a full C++11 thread/atomic support compiler, use whatever your platform determines, but there's always a way to pass a variable (at the least a void*) into the thread function. And that's how you get something to pass information back and forth via non-statics.

The best way to solve this is to use C++ atomics for that. Declare in some visible enough place:
std::atomic<int> my_thread_progress(0);
In a simple case this should be a static variable, in a more complex place this should be a data field of some object that manages threads or something similar.
On many platforms this will be slightly paranoiac because almost everywhere the read and write operations on integers are atomic. Bit using atomics still it makes because:
You will have guarantee that this will work fine on any platform, even on a 16 bit CPU or whatever unusual hardware;
Your code will be easier to read. Reader will immediately see that this is shared variable without placing any comments. Once it will be updated with load/store methods, it will be easier to catch on what is going on.
EDIT
Intel® 64 and IA-32 Architectures Software Developer’s Manual
Combined Volumes: 1, 2A, 2B, 2C, 3A, 3B and 3C (http://download.intel.com/products/processor/manual/325462.pdf)
Volume 3A: 8.1.1 Guaranteed Atomic Operations
The Intel486 processor (and newer processors since) guarantees that the following basic memory operations will always be carried out atomically:
Reading or writing a byte
Reading or writing a word aligned on a 16-bit boundary
Reading or writing a doubleword aligned on a 32-bit boundary

Related

Is 11 a valid output under ISO c++ for x86_64,arm or other arch?

This question is based on Can't relaxed atomic fetch_add reorder with later loads on x86, like store can?
I agree with answer given.
On x86 00 will never occur because a.fetch_add has a lock prefix/full barrier and loads can't reorder above fetch_add but on other architectures like arm/mips it can print 00. I have a two followup question about store buffer on x86 and arm.
I never get 11 on my pc (core i3 x86_64) i.e is 11 a valid output on
x86 in iso c++ , so am i missing something ? #Daniel Langr demonstrated 11 is a valid output on x86.
Now x86_64 has an advantage fetch_add acting as a full barrier.
For arm64 , output can be 00 sometimes due to cpu instruction reordering.
For arm64 or some other arch, can the output be 00 if without reordering ?. My question
is based on this. The store buffer values for function foo
a.fetch_add(1) is not visible to bar's a.load() and b.fetch_add(1) is
not visible to foo's b.load(). Hence we get 00 without reordering. Can this happen under ISO C++ on different archs ?
// g++ -O2 -pthread axbx.cpp ; while [ true ]; do ./a.out | grep "00" ; done
#include<cstdio>
#include<thread>
#include<atomic>
using namespace std;
atomic<int> a,b;
int reta,retb;
void foo(){
a.fetch_add(1,memory_order_relaxed); //add to a is stored in store buffer of cpu0
//a.store(1,memory_order_relaxed);
retb=b.load(memory_order_relaxed);
}
void bar(){
b.fetch_add(1,memory_order_relaxed); //add to b is stored in store buffer of cpu1
//b.store(1,memory_order_relaxed);
reta=a.load(memory_order_relaxed);
}
int main(){
thread t[2]{ thread(foo),thread(bar) };
t[0].join(); t[1].join();
printf("%d%d\n",reta,retb);
return 0;
}
Yes, ISO C++ allows this; as Daniel pointed out, one easy way is to put some slow stuff after the RMWs, before the loads, so they don't execute until both threads have had a chance to increment. This should be obvious because it doesn't require any run-time reordering to happen, just a simple interleaving of program-order. (So ISO C++ allows 11 even with seq_cst.) i.e. you could have exercised this by single-stepping each thread separately.
Are you wondering about how to create a practical demonstration on x86 without delay loops?
Try putting your atomic vars in separate cache lines so two different cores can be writing them in parallel.
alignas(64) std::atomic<int> a, b; // the alignas applies to each separately
With them in the same cache line, like probably happens by default, the core that won ownership of the cache line they're both in is going to be able to execute the load of the other var as soon as the full-barrier part of the increment is done. Completing the RMW means that cache line is already hot in this core's L1d cache. (A core can only modify a cache line after gaining exclusive ownership of it via MESI, which will also make it valid for reads.)
So both operations by one thread are extremely likely to happen before either operation by the other thread. (In the x86 asm, each operation has the same asm as its seq_cst equivalent would have had, so we can usefully talk about a global order of operations without losing anything.)
Probably the only thing that would stop this from happening is an interrupt arriving at just the right moment, between the RMW and the load.
You also asked a separate question:
can the output be 00 if without reordering ?
Clearly no. No interleaving of program-order can put both loads before either increment, so either run-time or compile-time reordering is necessary to create the 00 effect.
a.fetch_add(1,memory_order_relaxed); // foo1
retb=b.load(memory_order_relaxed); // foo2
b.fetch_add(1,memory_order_relaxed); // bar1
reta=a.load(memory_order_relaxed); // bar2
Mix those however you want without putting foo2 before foo1 or bar2 before bar1.
i.e. if you single-stepped each thread separately, you could never see 00. Of course the whole point of mo_relaxed is that it can reorder. Specifying "without reordering" is the same as saying "with seq_cst".
The effects of a store buffer are a kind of reordering, specifically Store Load reordering. mo_seq_cst prevents even that, which is part of the point of seq_cst and what makes it so expensive, especially for pure-store operations.

How can I show that volatile assignment is not atomic?

I understand that assignment may not be atomic in C++.
I am trying to trigger a race condition to show this.
However my code below does not seem to trigger any such.
How can I change it so that it will eventually trigger a race condition?
#include <iostream>
#include <thread>
volatile uint64_t sharedValue = 1;
const uint64_t value1 = 13;
const uint64_t value2 = 1414;
void write() {
bool even = true;
for (;;) {
uint64_t value;
if (even)
value = value1;
else
value = value2;
sharedValue = value;
even = !even;
}
}
void read() {
for (;;) {
uint64_t value = sharedValue;
if (value != value1 && value != value2) {
std::cout << "Race condition! Value: " << value << std::endl << std::flush;
}
}
}
int main()
{
std::thread t1(write);
std::thread t2(read);
t1.join();
}
I am using VS 2017 and compiling in Release x86.
Here is the disassembly of the assignment:
sharedValue = value;
00D54AF2 mov eax,dword ptr [ebp-18h]
00D54AF5 mov dword ptr [sharedValue (0D5F000h)],eax
00D54AFA mov ecx,dword ptr [ebp-14h]
00D54AFD mov dword ptr ds:[0D5F004h],ecx
I guess this means that the assignment is not atomic?
Seems 32 of the bits are copied into the general purpose 32 bit register eax and the other 32 bits are copied into another general purpose 32 bit register ecx before being copied into sharedValue which is in the data segment register?
I also tried with uint32_t and all data was copied in one go.
So I guess that on x86 there is no need to use std::atomic for 32 bit data types?
Some answers/comments suggested sleeping in the writer. This is not useful; hammering away on the cache line changing it as often as possible is what you want. (And what you get with volatile assignments and reads.) An assignment will be torn when a MESI share request for the cache line arrives at the writer core between committing two halves of a store from the store buffer to L1d cache.
If you sleep, you're waiting a long time without creating a window for that to happen. Sleeping between halves would make it even more easy to detect, but you can't do that unless you use separate memcpy to write halves of the 64-bit integer or something.
Tearing between reads in the reader is also possible even if writes are atomic. This may be less likely, but still happens plenty in practice. Modern x86 CPUs can execute two loads per clock cycle (Intel since Sandybridge, AMD since K8). I tested with atomic 64-bit stores but split 32-bit loads on Skylake and tearing is still frequent enough to spew lines of text in a terminal. So the CPU didn't manage to run everything in lock-step with corresponding pairs of reads always executing in the same clock cycle. So there is a window for the reader getting its cache line invalidated between a pair of loads. (However, all the pending cache-miss loads while the cache line is owned by the writer core probably complete all at once when the cache line arrives. And the total number of load buffers available is an even number in existing microarchitectures.)
As you discovered, your test values both had the same upper half of 0, so this made it impossible to observe any tearing; only the 32-bit aligned low half was ever changing, and was changing atomically because your compiler guarantees at least 4-byte alignment for uint64_t, and x86 guarantees that 4-byte aligned loads/stores are atomic.
0 and -1ULL are the obvious choices. I used the same thing in a test-case for this GCC C11 _Atomic bug for a 64-bit struct.
For your case, I'd do this. read() and write() are POSIX system call names so I picked something else.
#include <cstdint>
volatile uint64_t sharedValue = 0; // initializer = one of the 2 values!
void writer() {
for (;;) {
sharedValue = 0;
sharedValue = -1ULL; // unrolling is vastly simpler than an if
}
}
void reader() {
for (;;) {
uint64_t val = sharedValue;
uint32_t low = val, high = val>>32;
if (low != high) {
std::cout << "Tearing! Value: " << std::hex << val << '\n';
}
}
}
MSVC 19.24 -O2 compiles the writer to using a movlpd 64-bit store for the = 0, but two separate 32-bit stores of -1 for the = -1. (And the reader to two separate 32-bit loads). GCC uses a total of four mov dword ptr [mem], imm32 stores in the writer, like you'd expect. (Godbolt compiler explorer)
Terminology: it's always a race condition (even with atomicity you don't know which of the two values you're going to get). With std::atomic<> you'd only have that garden-variety race condition, no Undefined Behaviour.
The question is whether you actually see tearing from the data race Undefined Behaviour on the volatile object, on a specific C++ implementation / set of compile options, for a specific platform. Data race UB is a technical term with a more specific meaning than "race condition". I changed the error message to report the one symptom we check for. Note that data-race UB on a non-volatile object can have way weirder effects, like hosting the load or store out of loops, or even inventing extra reads leading to code that thinks one read was both true and false at the same time. (https://lwn.net/Articles/793253/)
I removed 2 redundant cout flushes: one from std::endl and one from std::flush. cout is line-buffered by default, or full buffered if writing to a file, which is fine. And '\n' is just as portable as std::endl as far as DOS line endings are concerned; text vs. binary stream mode handles that. endl is still just \n.
I simplified your check for tearing by checking that high_half == low_half. Then the compiler just has to emit one cmp/jcc instead of two extended-precision compares to see if the value is either 0 or -1 exactly. We know that there's no plausible way for false-negatives like high = low = 0xff00ff00 to happen on x86 (or any other mainstream ISA with any sane compiler).
So I guess that on x86 there is no need to use std::atomic for 32 bit data types?
Incorrect.
Hand-rolled atomics with volatile int can't give you atomic RMW operations (without inline asm or special functions like Windows InterlockedIncrement or GNU C builtin __atomic_fetch_add), and can't give you any ordering guarantees wrt. other code. (Release / acquire semantics)
When to use volatile with multi threading? - pretty much never.
Rolling your own atomics with volatile is still possible and de-facto supported by many mainstream compilers (e.g. the Linux kernel still does that, along with inline asm). Real-world compilers do effectively define the behaviour of data races on volatile objects. But it's generally a bad idea when there's a portable and guaranteed-safe way. Just use std::atomic<T> with std::memory_order_relaxed to get asm that's just as efficient as what you could get with volatile (for the cases where volatile works), but with guarantees of safety and correctness from the ISO C++ standard.
atomic<T> also lets you ask the implementation whether a given type can be cheaply atomic or not, with C++17 std::atomic<T>::is_always_lock_free or the older member function. (In practice C++11 implementations decided not to let some but not all instances of any given atomic be lock free based on alignment or something; instead they just give atomic the required alignas if there is one. So C++17 made a constant per-type constant instead of per-object member function way to check lock-freedom).
std::atomic can also give cheap lock-free atomicity for types wider than a normal register. e.g. on ARM, using ARMv6 strd / ldrd to store/load a pair of registers.
On 32-bit x86, a good compiler can implement std::atomic<uint64_t> by using SSE2 movq to do atomic 64-bit loads and stores, without falling back to the non-lock_free mechanism (a table of locks). In practice, GCC and clang9 do use movq for atomic<uint64_t> load/store. clang8.0 and earlier uses lock cmpxchg8b unfortunately. MSVC uses lock cmpxchg8b in an even more inefficient way. Change the definition of sharedVariable in the Godbolt link to see it. (Or if you use one each of default seq_cst and memory_order_relaxed stores in the loop, MSVC for some reason calls a ?store#?$_Atomic_storage#_K$07#std##QAEX_KW4memory_order#2##Z helper function for one of them. But when both stores are the same ordering, it inlines lock cmpxchg8b with much clunkier loops than clang8.0) Note that this inefficient MSVC code-gen is for a case where volatile wasn't atomic; in cases where it is, atomic<T> with mo_relaxed compiles nicely, too.
You generally can't get that wide-atomic code-gen from volatile. Although GCC does actually use movq for your if() bool write function (see the earlier Godbolt compiler explorer link) because it can't see through the alternating or something. It also depends on what values you use. With 0 and -1 it uses separate 32-bit stores, but with 0 and 0x0f0f0f0f0f0f0f0fULL you get movq for a usable pattern. (I used this to verify that you can still get tearing from just the read side, instead of hand-writing some asm.) My simple unrolled version compiles to just use plain mov dword [mem], imm32 stores with GCC. This is a good example of there being zero guarantee of how volatile really compiles in this level of detail.
atomic<uint64_t> will also guarantee 8-byte alignment for the atomic object, even if plain uint64_t might have only been 4-byte aligned.
In ISO C++, a data race on a volatile object is still Undefined Behaviour. (Except for volatile sig_atomic_t racing with a signal handler.)
A "data race" is any time two unsynchronized accesses happen and they're not both reads. ISO C++ allows for the possibility of running on machines with hardware race detection or something; in practice no mainstream systems do that so the result is just tearing if the volatile object isn't "naturally atomic".
ISO C++ also in theory allows for running on machines that don't have coherent shared memory and require manual flushes after atomic stores, but that's not really plausible in practice. No real-world implementations are like that, AFAIK. Systems with cores that have non-coherent shared memory (like some ARM SoCs with DSP cores + microcontroller cores) don't start std::thread across those cores.
See also Why is integer assignment on a naturally aligned variable atomic on x86?
It's still UB even if you don't observe tearing in practice, although as I said real compilers do de-facto define the behaviour of volatile.
Skylake experiments to try to detect store-buffer coalescing
I wondered if store coalescing in the store buffer could maybe create an atomic 64-bit commit to L1d cache out of two separate 32-bit stores. (No useful results so far, leaving this here in case anyone's interested or wants to build on it.)
I used a GNU C __atomic builtin for the reader, so if the stores also ended up being atomic we'd see no tearing.
void reader() {
for (;;) {
uint64_t val = __atomic_load_n(&sharedValue, __ATOMIC_ACQUIRE);
uint32_t low = val, high = val>>32;
if (low != high) {
std::cout << "Tearing! Value: " << std::hex << val << '\n';
}
}
}
This was one attempt get the microachitecture to group the stores.
void writer() {
volatile int separator; // in a different cache line, has to commit separately
for (;;) {
sharedValue = 0;
_mm_mfence();
separator = 1234;
_mm_mfence();
sharedValue = -1ULL; // unrolling is vastly simpler than an if
_mm_mfence();
separator = 1234;
_mm_mfence();
}
}
I still see tearing with this. (mfence on Skylake with updated microcode is like lfence, and blocks out-of-order exec as well as draining the store buffer. So later stores shouldn't even enter the store buffer before the later ones leave. That might actually be a problem, because we need time for merging, not just committing a 32-bit store as soon as it "graduates" when the store uops retire).
Probably I should try to measure the rate of tearing and see if it's less frequent with anything, because any tearing at all is enough to spam a terminal window with text on a 4GHz machine.
Grab the disassembly and then check the documentation for your architecture; on some machines you'll find even standard "non-atomic" operations (in terms of C++) are actually atomic when it hits the hardware (in terms of assembly).
With that said, your compiler will know what is and isn't safe and it is therefore a better idea to use the std::atomic template to make your code more portable across architectures. If you're on a platform that doesn't require anything special, it'll typically get optimised down to a primitive type anyway (putting memory ordering aside).
I don't remember the details of x86 operations off hand, but I would guess that you have a data race if the 64 bit integer is written in 32 bit "chunks" (or less); it is possible to get a torn read that case.
There are also tools called thread sanitizer to catch it in the act. I don't believe they're supported on Windows with MSVC, but if you can get GCC or clang working then you might have some luck there. If your code is portable (it looks it), then you can run it on a Linux system (or VM) using these tools.
I changed the code to:
volatile uint64_t sharedValue = 0;
const uint64_t value1 = 0;
const uint64_t value2 = ULLONG_MAX;
and now the code triggers the race condition in less than a second.
The problem was that both 13 and 1414 has the 32 MSB = 0.
13=0xd
1414=0x586
0=0x0
ULLONG_MAX=0xffffffffffffffff
First, your code has a data race condition when reading and writing to sharedValue variable without any synchronization which is Undefined Behavior in C++. This can be fixed by making sharedValue an atomic variable:
std::atomic<uint64_t> sharedValue{1};
You can trigger a logical race condition by using an artificial delay before writing to sharedValue ("Race condition! Value: ..." message will be printed in reader thread). You can use std::this_thread::sleep_for for this:
void write() {
bool even = true;
for (;;) {
uint64_t value;
if (even)
value = value1;
else
value = value2;
using namespace std::chrono_literals;
std::this_thread::sleep_for(1ms);
sharedValue = value;
even = !even;
}
}

Multi Threading - Peterson's algorithm not working

Here I use Peterson's algorithm to implement mutual exclusion.
I have two very simple threads, one to increase a counter by 1, another to reduce it by 1.
const int PRODUCER = 0,CONSUMER =1;
int counter;
int flag[2];
int turn;
void *producer(void *param)
{
flag[PRODUCER]=1;
turn=CONSUMER;
while(flag[CONSUMER] && turn==CONSUMER);
counter++;
flag[PRODUCER]=0;
}
void *consumer(void *param)
{
flag[CONSUMER]=1;
turn=PRODUCER;
while(flag[PRODUCER] && turn==PRODUCER);
counter--;
flag[CONSUMER]=0;
}
They works fine when I just run them once.
But when I run them again again in a loop, strange things happen.
Here is my main function.
int main(int argc, char *argv[])
{
int case_count =0;
counter =0;
while(counter==0)
{
printf("Case: %d\n",case_count++);
pthread_t tid[2];
pthread_attr_t attr[2];
pthread_attr_init(&attr[0]);
pthread_attr_init(&attr[1]);
counter=0;
flag[0]=0;
flag[1]=0;
turn = 0;
printf ("Counter is intially set to %d\n",counter);
pthread_create(&tid[0],&attr[0],producer,NULL);
pthread_create(&tid[1],&attr[1],consumer,NULL);
pthread_join(tid[0],NULL);
pthread_join(tid[1],NULL);
printf ("counter is now %d\n",counter);
}
return 0;
}
I run the two threads again and again, until in one case the counter isn't zero.
Then, after several cases, the program will always stop! Some times after hundreds of cases, some times thousands, or event tens of thousand.
It means in one case the counter isn't zero. But why??? the two threads modify the counter in critical session, and increase and decrease it only once. Why will the counter not be zero?
Then I run this code in other computers, more strange things happen - in some computers the program seems has no problem, and the others have the same problem with me! Why?
By the way, in my computer, I run this code in VM ware's virtual computer, Ubuntu 16.04. Others' computer is also Ubuntu 16.04, but not all of them are in virtual machines. And the computer with problem contains both virtual machines and real machines.
Peterson's algorithm only works on single core processors/single CPU systems.
That's because they don't do real parallel processing. Two atomar operations never get executet at the same time there.
If you got 2 or more CPUs/CPU cores the amount of atomar operations who can be executed at the same time increase by one for each cpu(core).
This means, even if an integer assignment is atomar it can be executed multiple times at the same time in different CPUs/Cores.
In your case turn=CONSUMER/PRODUCER; is just called twice at the same time in different CPUs/cores.
Deacitvate all CPU cores but one for your program and it should work fine.
You need hardware support to implement any kind of thread-safe algorithm.
There are many reasons why your code is not working at you intended. The simplest one is that the cores have individual caches. So your program starts on say two cores. Both cache flag to be 0, 0. They both modify their own copy, so they don't see what the other core is doing.
In addition memory works in blocks, so writing flag[PRODUCER] will very likely write flag[CONSUMER] as well (because ints are 4 bytes and most of todays processors have memory blocks of 64 bytes).
Another problem would be operation reordering. Both the compiler and the processor are allowed to swap instructions. There are constraints that dictate that the single threaded execution result shouldn't change, but obviously they don't apply here.
The compiler might also figure out that you are setting turn to x and then checking if it is x, which is obviously true in a single threaded world so it can be optimized away.
This list is not exhaustive. There are many more things (some platform specific) that could happen and break your program.
So, at the very least try to use std::atomic types with strong memory ordering (memory_order_seq_cst). All your variables should be std::atomic. This gives you hardware support but it will be a lot slower.
This will still not work because most you might still have some piece of code where you read and then change. This is not atomic because some other thread might have changed the data after your read and before you changed it.

EnterSynchronizationBarrier hangs in windows 8

I tried to use new API for synchronization barriers from Windows 8, but the following simple code sometimes hangs in Windows 8:
#undef WINVER
#define WINVER 0x0603
#include "windows.h"
#include <thread>
#include <vector>
int main()
{
SYNCHRONIZATION_BARRIER barrier;
int count = 32;
InitializeSynchronizationBarrier (&barrier, count, -1);
std::vector<std::thread> threads;
for (int thr_num = 0; thr_num < count; thr_num++)
{
threads.emplace_back ([thr_num]
{
for (int i = 0; i < 100000; i++)
EnterSynchronizationBarrier (&barrier, 0);
});
}
for (auto &thr : threads)
thr.join ();
return 0;
}
Tested on Windows 8.1 64-bit on 32-core dual-Xeon E5 2630. It hangs roughly one time out of ten launches.
It seems that in windows 10 it works normally (on another machine). Is this a bug in windows 8 that got fixed, or this is not a correct usage of EnterSynchronizationBarrier (maybe you can't call it in a loop?). There're not much information about this function, have anybody even used it?
Not that it matters years later, except perhaps to show that some problems are too obscure for Stack Overflow to deliver close attention in useful time, but your usage is correct, if extreme, and the stress you have put the called function to does look to have exposed its problems with memory barriers.
In your fragment, a synchronisation barrier is prepared for 32 threads and you create 32 threads which each proceed to 100000 phases of synchronised work. All 32 reach their call number N to EnterSynchronizationBarrier before all are released on their way to their call number N+1. It should work. It likely would if your phases had any substance.
The stress is that each phase between calls is just however few instructions are involved in looping back to repeat the call. While the last thread to end phase N is in its call, it signals the others to leave, and they have a good chance of leaving (and even of reentering the function to end their phase N+1) while the thread that ends phase N is still doing its internal bookkeeping.
In this bookkeeping are two counters. One, named Barrier according to Microsoft's symbol files, is decremented as threads enter the synchronisation barrier. The other, named LeftBarrier, is incremented as they leave it. The thread that ends a phase resets Barrier from LeftBarrier (which should be the count of all participating threads) and resets LeftBarrier to 1. Or so it goes as a simplification.
The complicated reality is that the Barrier count is overloaded: its high bit signifies the change of phase. If a thread that waits at the synchronisation barrier is spinning rather than blocking on an event, then what it checks for while spinning is whether the high bit in Barrier has changed. It therefore really matters exactly how the counters get reset in the ending thread's bookkeeping. The sequence is: read LeftBarrier; write LeftBarrier as 1; write Barrier as the old LeftBarrier with the high bit toggled.
What I think happens is that without a memory barrier, the Barrier count can be written before LeftBarrier, but because Barrier has a toggled high bit, a spinning thread comes out of its spin and increments LeftBarrier from another processor before the first resets it to 1. The increment gets lost, after which all bets are off because subsequent phases will find that LeftBarrier at the end of a phase is no longer the count of participating threads.
Windows 8 and 8.1 have no memory barrier here. Windows 10 does, though I believe it's in the wrong place and that Windows Vista and Windows 7 had it correctly between the two writes. The implementation was anyway reworked completely for Version 1607 so that it now uses the WaitOnAddress functionality, much as sketched by a later Raymond Chen blog than the one cited by one of your correspondents. At the time of the cited blog, Microsoft, though possibly not Raymond, surely knew of the function's two earlier code changes regarding memory barriers.

do integer reads need to be critical section protected?

I have come across C++03 some code that takes this form:
struct Foo {
int a;
int b;
CRITICAL_SECTION cs;
}
// DoFoo::Foo foo_;
void DoFoo::Foolish()
{
if( foo_.a == 4 )
{
PerformSomeTask();
EnterCriticalSection(&foo_.cs);
foo_.b = 7;
LeaveCriticalSection(&foo_.cs);
}
}
Does the read from foo_.a need to be protected? e.g.:
void DoFoo::Foolish()
{
EnterCriticalSection(&foo_.cs);
int a = foo_.a;
LeaveCriticalSection(&foo_.cs);
if( a == 4 )
{
PerformSomeTask();
EnterCriticalSection(&foo_.cs);
foo_.b = 7;
LeaveCriticalSection(&foo_.cs);
}
}
If so, why?
Please assume the integers are 32-bit aligned. The platform is ARM.
Technically yes, but no on many platforms. First, let us assume that int is 32 bits (which is pretty common, but not nearly universal).
It is possible that the two words (16 bit parts) of a 32 bit int will be read or written to separately. On some systems, they will be read separately if the int isn't aligned properly.
Imagine a system where you can only do 32-bit aligned 32 bit reads and writes (and 16-bit aligned 16 bit reads and writes), and an int that straddles such a boundary. Initially the int is zero (ie, 0x00000000)
One thread writes 0xBAADF00D to the int, the other reads it "at the same time".
The writing thread first writes 0xBAAD to the high word of the int. The reader thread then reads the entire int (both high and low) getting 0xBAAD0000 -- which is a state that the int was never put into on purpose!
The writer thread then writes the low word 0xF00D.
As noted, on some platforms all 32 bit reads/writes are atomic, so this isn't a concern. There are other concerns, however.
Most lock/unlock code includes instructions to the compiler to prevent reordering across the lock. Without that prevention of reordering, the compiler is free to reorder things so long as it behaves "as-if" in a single threaded context it would have worked that way. So if you read a then b in code, the compiler could read b before it reads a, so long as it doesn't see an in-thread opportunity for b to be modified in that interval.
So possibly the code you are reading is using these locks to make sure that the read of the variable happens in the order written in the code.
Other issues are raised in the comments below, but I don't feel competent to address them: cache issues, and visibility.
Looking at this it seems that arm has quite relaxed memory model so you need a form of memory barrier to ensure that writes in one thread are visible when you'd expect them in another thread. So what you are doing, or else using std::atomic seems likely necessary on your platform. Unless you take this into account you can see updates out of order in different threads which would break your example.
I think you can use C++11 to ensure that integer reads are atomic, using (for example) std::atomic<int>.
The C++ standard says that there's a data race if one thread writes to a variable at the same time as another thread reads from that variable, or if two threads write to the same variable at the same time. It further says that a data race produces undefined behavior. So, formally, you must synchronize those reads and writes.
There are three separate issues when one thread reads data that was written by another thread. First, there is tearing: if writing requires more than a single bus cycle, it's possible for a thread switch to occur in the middle of the operation, and another thread could see a half-written value; there's an analogous problem if a read requires more than a single bus cycle. Second, there's visibility: each processor has its own local copy of the data that it's been working on recently, and writing to one processor's cache does not necessarily update another processor's cache. Third, there's compiler optimizations that reorder reads and writes in ways that would be okay within a single thread, but will break multi-threaded code. Thread-safe code has to deal with all three problems. That's the job of synchronization primitives: mutexes, condition variables, and atomics.
Although the integer read/write operation indeed will most likely be atomic, the compiler optimizations and processor cache will still give you problems if you don't do it properly.
To explain - the compiler will normally assume that the code is single-threaded and make many optimizations that rely on that. For example, it might change the order of instructions. Or, if it sees that the variable is only written and never read, it might optimize it away entirely.
The CPU will also cache that integer, so if one thread writes it, the other one might not get to see it until a lot later.
There are two things you can do. One is to wrap in in critical section like in your original code. The other is to mark the variable as volatile. That will signal the compiler that this variable will be accessed by multiple threads and will disable a range of optimizations, as well as placing special cache-sync instructions (aka "memory barriers") around accesses to the variable (or so I understand). Apparently this is wrong.
Added: Also, as noted by another answer, Windows has Interlocked APIs that can be used to avoid these issues for non-volatile variables.