output 10 with memory_order_seq_cst

output 10 with memory_order_seq_cst - c++

When i run this program i get output as 10 which seems to be impossible for me. I'm running this on x86_64 core i3 ubuntu.
If the output is 10, then 1 must have come from either c or d.
Also in thread t[0], we assign c as 1. Now a is 1 since it occurs before c=1. c is equal to b which was set to 1 by thread 1. So when we store d it should be 1 as a=1.
Can output 10 happen with memory_order_seq_cst ? I tried inserting a atomic_thread_fence(seq_cst) on both thread between 1st (variable =1 ) and 2nd line (printf) but it still didn't work.
Uncommenting both the fence doesn't work.
Tried running with g++ and clang++. Both give the same result.
#include<thread>
#include<unistd.h>
#include<cstdio>
#include<atomic>
using namespace std;
atomic<int> a,b,c,d;
void foo(){
a.store(1,memory_order_seq_cst);
// atomic_thread_fence(memory_order_seq_cst);
c.store(b,memory_order_seq_cst);
}
void bar(){
b.store(1,memory_order_seq_cst);
// atomic_thread_fence(memory_order_seq_cst);
d.store(a,memory_order_seq_cst);
}
int main(){
thread t[2];
t[0]=thread(foo); t[1]=thread(bar);
t[0].join();t[1].join();
printf("%d%d\n",c.load(memory_order_seq_cst),d.load(memory_order_seq_cst));
}
bash$ while [ true ]; do ./a.out | grep "10" ; done
10
10
10
10

10 (c=1, d=0) is easily explained: bar happened to run first, and finished before foo read b.
Quirks of inter-core communication to get threads started on different cores means it's easily possible for this to happen even though thread(foo) ran first in the main thread. e.g. maybe an interrupt arrived at the core the OS chose for foo, delaying it from actually getting into that code1.
Remember that seq_cst only guarantees that some total order exists for all seq_cst operations which is compatible with the sequenced-before order within each thread. (And any other happens-before relationship established by other factors). So the following order of atomic operations is possible without even breaking out the a.load2 in bar separately from the d.store of the resulting int temporary.
b.store(1,memory_order_seq_cst); // bar1. b=1
d.store(a,memory_order_seq_cst); // bar2. a.load reads 0, d=0
a.store(1,memory_order_seq_cst); // foo1
c.store(b,memory_order_seq_cst); // foo2. b.load reads 1, c=1
// final: c=1, d=0
atomic_thread_fence(seq_cst) has no impact anywhere because all your operations are already seq_cst. A fence basically just stops reordering of this thread's operations; it doesn't wait for or sync with fences in other threads.
(Only a load that sees a value stored by another thread can create synchronization. But such a load doesn't wait for the other store; it has no way of knowing there is another store. If you want to keep loading until you see the value you expect, you have to write a spin-wait loop.)
Footnote 1:
Since all your atomic vars are probably in the same cache line, even if execution did reach the top of foo and bar at the same time on two different cores, false-sharing is likely going to let both operations from one thread happen while the other core is still waiting to get exclusive ownership. Although seq_cst stores are slow enough (on x86 at least) that hardware fairness stuff might relinquish exclusive ownership after committing the first store of 1. Anyway, lots of ways for both operations in one thread to happen before the other thread and get 10 or 01. Even possible to get 11 if we get b=1 then a=1 before either load. Using seq_cst does stop the hardware from doing the load early (before the store is globally visible), so it's very possible.
Footnote 2: The lvalue-to-rvalue evaluation of bare a uses the overloaded (int) conversion which is equivalent to a.load(seq_cst). The operations from foo could happen between that load and the d.store that gets a temporary value from it. d.store(a) is not an atomic copy; it's equivalent to int tmp = a; d.store(tmp);. That isn't necessary to explain your observations.

The printf statements are unsynchronized so output of 10 can be just a reordered 01.
01 happens when the functions before the printf run serially.

Related

volatile variable updated from multiple threads C++

volatile bool b;
Thread1: //only reads b
void f1() {
while (1) {
if (b) {do something};
else { do something else};
}
}
Thread2:
//only sets b to true if certain condition met
// updated by thread2
void f2() {
while (1) {
//some local condition evaluated - local_cond
if (!b && (local_cond == true)) b = true;
//some other work
}
}
Thread3:
//only sets b to false when it gets a message on a socket its listening to
void f3() {
while (1) {
//select socket
if (expected message came) b = false;
//do some other work
}
}
If thread2 updates b first at time t and later thread3 updates b at time t+5:
will thread1 see the latest value "in time" whenever it is reading b?
for example: reads from t+delta to t+5+delta should read true and
reads after t+5+delta should read false.
delta is the time for the store of "b" into memory when one of threads 2 or 3 updated it

The effect of volatile keyword is principally two things (I avoid scientifically strict formulations here):
1) Its accesses can't be cached or combined. (UPD: on suggestion, I underline this is for caching in registers or another compiler-provided location, not the RAM cache in CPU.) For example, the following code:
x = 1;
x = 2;
for a volatile x will never be combined into single x = 2, whatever optimization level is required; but if x is not volatile, even low levels will likely cause this collapse into a single write. The same for reads: each read operation will access the variable value without any attempt to cache it.
2) All volatile operations are relayed onto machine command layer in the same order between them (to underline, only between volatile operations), as they are defined in source code.
But this is not true for accesses between non-volatile and volatile memory. For the following code:
int *x;
volatile int *vy;
void foo()
{
*x = 1;
*vy = 101;
*x = 2;
*vy = 102;
}
gcc (9.4) with -O2 and clang (10.0) with -O produce something similar to:
movq x(%rip), %rax
movq vy(%rip), %rcx
movl $101, (%rcx)
movl $2, (%rax)
movl $102, (%rcx)
retq
so one access to x is already gone, despite its presence between two volatile accesses. If one need the first x = 1 to succeed before first write to vy, let him put an explicit barrier (since C11, atomic_signal_fence is the platform-independent mean for this).
That was the common rule but without regarding multithread issues. What happens here with multithreading?
Well, imagine as you declare that thread 2 writes true to b, so, this is writing of value 1 to single-byte location. But, this is ordinary write without any memory ordering requirements. What you provided with volatile is that compiler won't optimize it. But what for processor?
If this was a modern abstract processor, or one with relaxed rules, like ARM, I'd say nothing prevent it from postponing the real write for an indefinite time. (To clarify, "write" is exposing the operation to RAM-and-all-caches conglomerate.) It's fully up to processor's deliberation. Well, processors are designed to flush their stockpiling of pending writes as fast as possible. But what affects real delay, you can't know: for example, it could "decide" to fill instruction cache with a few next lines, or flush another queued writings... lots of variants. The only thing we know it provides "best effort" to flush all queued operations, to avoid getting buried under previous results. That's truly natural and nothing more.
With x86, there is an additional factor. Nearly every memory write (and, I guess, this one as well) is "releasing" write in x86, so, all previous reads and writes shall be completed before this write. But, the gut fact is that the operations to complete are before this write. So when you write true to volatile b, you will be sure all previous operations have already got visible to other participants... but this one still could be postponed for a while... how long? Nanoseconds? Microseconds? Any other write to memory will flush and so publish this write to b... do you have writes in cycle iteration of thread 2?
The same affects thread 3. You can't be sure this b = false will be published to other CPUs when you need it. Delay is unpredictable. The only thing is guaranteed, if this is not a realtime-aware hardware system, for an indefinite time, and the ISA rules and barriers provide ordering but not exact times. And, x86 is definitely not for such a realtime.
Well, all this means you also need an explicit barrier after write which affects not only compiler, but CPU as well: barrier before previous write and following reads or writes. Among C/C++ means, full barrier satifies this - so you have to add std::atomic_thread_fence(std::memory_order_seq_cst) or use atomic variable (instead of plain volatile one) with the same memory order for write.
And, all this still won't provide you with exact timings like you described ("t" and "t+5"), because the visible "timestamps" of the same operation can differ for different CPUs! (Well, this resembles Einstein's relativity a bit.) All you could say in this situation is that something is written into memory, and typically (not always) the inter-CPU order is what you expected (but the ordering violation will punish you).
But, I can't catch the general idea of what do you want to implement with this flag b. What do you want from it, what state should it reflect? Let you return to the upper level task and reformulate. Is this (I'm just guessing on coffee grounds) a green light to do something, which is cancelled by an external order? If so, an internal permission ("we are ready") from the thread 2 shall not drop this cancellation. This can be done using different approaches, as:
1) Just separate flags and a mutex/spinlock around their set. Easy but a bit costly (or even substantially costly, I don't know your environment).
2) An atomically modified analog. For example, you can use a bitfield variable which is modified using compare-and-swap. Assign bit 0 to "ready" but bit 1 for "cancelled". For C, atomic_compare_exchange_strong is what you'll need here at x86 (and at most other ISAs). And, volatile is not needed anymore here if you keep residing with memory_order_seq_cst.

Will thread1 see the latest value "in time" whenever it is reading b?
Yes, the volatile keyword denotes that it can be modified outside of the thread or hardware without the compiler being aware thus every access (both read and write) will be made through an lvalue expression of volatile-qualified type is considered an observable side effect for the purpose of optimization and is evaluated strictly according to the rules of the abstract machine (that is, all writes are completed at some time before the next sequence point). This means that within a single thread of execution, a volatile access cannot be optimized out or reordered relative to another visible side effect that is separated by a sequence point from the volatile access.
Unfortunately, the volatile keyword is not thread-safe and operation will have to be taken with care, it is recommended to use atomic for this, unless in an embedded or bare-metal scenario.
Also the whole struct should be atomic struct X {int a; volatile bool b;};.

Say I have a system with 2 cores. The first core runs thread 2, the second core runs thread 3.
reads from t+delta to t+5+delta should read true and reads after t+5+delta should read false.
Problem is that thread 1 will read at t + 10000000 when the kernel decides one of the threads has run long enough and schedules a different thread. So it likely thread1 will not see the change a lot of the time.
Note: this ignores all the additional problems of synchronicity of caches and observability. If the thread isn't even running all of that becomes irrelevant.

Is 11 a valid output under ISO c++ for x86_64,arm or other arch?

This question is based on Can't relaxed atomic fetch_add reorder with later loads on x86, like store can?
I agree with answer given.
On x86 00 will never occur because a.fetch_add has a lock prefix/full barrier and loads can't reorder above fetch_add but on other architectures like arm/mips it can print 00. I have a two followup question about store buffer on x86 and arm.
I never get 11 on my pc (core i3 x86_64) i.e is 11 a valid output on
x86 in iso c++ , so am i missing something ? #Daniel Langr demonstrated 11 is a valid output on x86.
Now x86_64 has an advantage fetch_add acting as a full barrier.
For arm64 , output can be 00 sometimes due to cpu instruction reordering.
For arm64 or some other arch, can the output be 00 if without reordering ?. My question
is based on this. The store buffer values for function foo
a.fetch_add(1) is not visible to bar's a.load() and b.fetch_add(1) is
not visible to foo's b.load(). Hence we get 00 without reordering. Can this happen under ISO C++ on different archs ?
// g++ -O2 -pthread axbx.cpp ; while [ true ]; do ./a.out | grep "00" ; done
#include<cstdio>
#include<thread>
#include<atomic>
using namespace std;
atomic<int> a,b;
int reta,retb;
void foo(){
a.fetch_add(1,memory_order_relaxed); //add to a is stored in store buffer of cpu0
//a.store(1,memory_order_relaxed);
retb=b.load(memory_order_relaxed);
}
void bar(){
b.fetch_add(1,memory_order_relaxed); //add to b is stored in store buffer of cpu1
//b.store(1,memory_order_relaxed);
reta=a.load(memory_order_relaxed);
}
int main(){
thread t[2]{ thread(foo),thread(bar) };
t[0].join(); t[1].join();
printf("%d%d\n",reta,retb);
return 0;
}

Yes, ISO C++ allows this; as Daniel pointed out, one easy way is to put some slow stuff after the RMWs, before the loads, so they don't execute until both threads have had a chance to increment. This should be obvious because it doesn't require any run-time reordering to happen, just a simple interleaving of program-order. (So ISO C++ allows 11 even with seq_cst.) i.e. you could have exercised this by single-stepping each thread separately.
Are you wondering about how to create a practical demonstration on x86 without delay loops?
Try putting your atomic vars in separate cache lines so two different cores can be writing them in parallel.
alignas(64) std::atomic<int> a, b; // the alignas applies to each separately
With them in the same cache line, like probably happens by default, the core that won ownership of the cache line they're both in is going to be able to execute the load of the other var as soon as the full-barrier part of the increment is done. Completing the RMW means that cache line is already hot in this core's L1d cache. (A core can only modify a cache line after gaining exclusive ownership of it via MESI, which will also make it valid for reads.)
So both operations by one thread are extremely likely to happen before either operation by the other thread. (In the x86 asm, each operation has the same asm as its seq_cst equivalent would have had, so we can usefully talk about a global order of operations without losing anything.)
Probably the only thing that would stop this from happening is an interrupt arriving at just the right moment, between the RMW and the load.
You also asked a separate question:
can the output be 00 if without reordering ?
Clearly no. No interleaving of program-order can put both loads before either increment, so either run-time or compile-time reordering is necessary to create the 00 effect.
a.fetch_add(1,memory_order_relaxed); // foo1
retb=b.load(memory_order_relaxed); // foo2
b.fetch_add(1,memory_order_relaxed); // bar1
reta=a.load(memory_order_relaxed); // bar2
Mix those however you want without putting foo2 before foo1 or bar2 before bar1.
i.e. if you single-stepped each thread separately, you could never see 00. Of course the whole point of mo_relaxed is that it can reorder. Specifying "without reordering" is the same as saying "with seq_cst".
The effects of a store buffer are a kind of reordering, specifically Store Load reordering. mo_seq_cst prevents even that, which is part of the point of seq_cst and what makes it so expensive, especially for pure-store operations.

Why my std::atomic<int> variable isn't thread-safe?

I don't know why my code isn't thread-safe, as it outputs some inconsistent results.
value 48
value 49
value 50
value 54
value 51
value 52
value 53
My understanding of an atomic object is it prevents its intermediate state from exposing, so it should solve the problem when one thread is reading it and the other thread is writing it.
I used to think I could use std::atomic without a mutex to solve the multi-threading counter increment problem, and it didn't look like the case.
I probably misunderstood what an atomic object is, Can someone explain?
void
inc(std::atomic<int>& a)
{
while (true) {
a = a + 1;
printf("value %d\n", a.load());
std::this_thread::sleep_for(std::chrono::milliseconds(1000));
}
}
int
main()
{
std::atomic<int> a(0);
std::thread t1(inc, std::ref(a));
std::thread t2(inc, std::ref(a));
std::thread t3(inc, std::ref(a));
std::thread t4(inc, std::ref(a));
std::thread t5(inc, std::ref(a));
std::thread t6(inc, std::ref(a));
t1.join();
t2.join();
t3.join();
t4.join();
t5.join();
t6.join();
return 0;
}

I used to think I could use std::atomic without a mutex to solve the multi-threading counter increment problem, and it didn't look like the case.
You can, just not the way you have coded it. You have to think about where the atomic accesses occur. Consider this line of code …
a = a + 1;
First the value of a is fetched atomically. Let's say the value fetched is 50.
We add one to that value getting 51.
Finally we atomically store that value into a using the = operator
a ends up being 51
We atomically load the value of a by calling a.load()
We print the value we just loaded by calling printf()
So far so good. But between steps 1 and 3 some other threads may have changed the value of a - for example to the value 54. So, when step 3 stores 51 into a it overwrites the value 54 giving you the output you see.
As #Sopel and #Shawn suggest in the comments, you can atomically increment the value in a using one of the appropriate functions (like fetch_add) or operator overloads (like operator ++ or operator +=. See the std::atomic documentation for details
Update
I added steps 5 and 6 above. Those steps can also lead to results that may not look correct.
Between the store at step 3. and the call tp a.load() at step 5. other threads can modify the contents of a. After our thread stores 51 in a at step 3 it may find that a.load() returns some different number at step 5. Thus the thread that set a to the value 51 may not pass the value 51 to printf().
Another source of problems is that nothing coordinates the execution of steps 5. and 6. between two threads. So, for example, imagine two threads X and Y running on a single processor. One possible execution order might be this …
Thread X executes steps 1 through 5 above incrementing a from 50 to 51 and getting the value 51 back from a.load()
Thread Y executes steps 1 through 5 above incrementing a from 51 to 52 and getting the value 52 back from a.load()
Thread Y executes printf() sending 52 to the console
Thread X executes printf() sending 51 to the console
We've now printed 52 on the console, followed by 51.
Finally, there's another problem lurking at step 6. because printf() doesn't make any promises about what happens if two threads call printf() at the same time (at least I don't think it does).
On a multiprocessor system threads X and Y above might call printf() at exactly the same moment (or within a few ticks of exactly the same moment) on two different processors. We can't make any prediction about which printf() output will appear first on the console.
Note The documentation for printf mentions a lock introduced in C++17 "… used to prevent data races when multiple threads read, write, position, or query the position of a stream." In the case of two threads simultaneously contending for that lock we still can't tell which one will win.

Besides the increment of a being done non-atomically, the fetch of the value to display after the increment is non-atomic with respect to the increment. It is possible that one of the other threads increments a after the current thread has incremented it but before the fetch of the value to display. This would possibly result in the same value being shown twice, with the previous value skipped.
Another issue here is that the threads do not necessarily run in the order they have been created. Thread 7 could execute its output before threads 4, 5, and 6, but after all four threads have incremented a. Since the thread that did the last increment displays its output earlier, you end up with the output not being sequential. This is more likely to happen on a system with fewer than six hardware threads available to run on.
Adding a small sleep between the various thread creates (e.g., sleep_for(10)) would make this less likely to occur, but would still not eliminate the possibility. The only sure way to keep the output ordered is to use some sort of exclusion (like a mutex) to ensure only one thread has access to the increment and output code, and treat both the increment and output code as a single transaction that must run together before another thread tries to do an increment.

The other answers point out the non-atomic increment and various problems. I mostly want to point out some interesting practical details about exactly what we see when running this code on a real system. (x86-64 Arch Linux, gcc9.1 -O3, i7-6700k 4c8t Skylake).
It can be useful to understand why certain bugs or design choices lead to certain behaviours, for troubleshooting / debugging.
Use int tmp = ++a; to capture the fetch_add result in a local variable instead of reloading it from the shared variable. (And as 1202ProgramAlarm says, you might want to treat the whole increment and print as an atomic transaction if you insist on having your counts printed in order as well as being done properly.)
Or you might want to have each thread record the values it saw in a private data structure to be printed later, instead of also serializing threads with printf during the increments. (In practice all trying to increment the same atomic variable will serialize them waiting for access to the cache line; ++a will go in order so you can tell from the modification order which thread went in which order.)
Fun fact: a.store(1 + a.load(std:memory_order_relaxed), std::memory_order_release) is what you might do for a variable that was only written by 1 thread, but read by multiple threads. You don't need an atomic RMW because no other thread ever modifies it. You just need a thread-safe way to publish updates. (Or better, in a loop keep a local counter and just .store() it without loading from the shared variable.)
If you used the default a = ... for a sequentially-consistent store, you might as well have done an atomic RMW on x86. One good way to compile that is with an atomic xchg, or mov+mfence is as expensive (or more).
What's interesting is that despite the massive problems with your code, no counts were lost or stepped on (no duplicate counts), merely printing reordered. So in practice the danger wasn't encountered because of other effects going on.
I tried it on my own machine and did lose some counts. But after removing the sleep, I just got reordering. (I copy-pasted about 1000 lines of the output into a file, and sort -u to uniquify the output didn't change the line count. It did move some late prints around though; presumably one thread got stalled for a while.) My testing didn't check for the possibility of lost counts, skipped by not saving the value being stored into a, and instead reloading it. I'm not sure there's a plausible way for that to happen here without multiple threads reading the same count, which would be detected.
Store + reload, even a seq-cst store which has to flush the store buffer before it can reload, is very fast compared to printf making a write() system call. (The format string includes a newline and I didn't redirect output to a file so stdout is line-buffered and can't just append the string to a buffer.)
(write() system calls on the same file descriptor are serializing in POSIX: write(2) is atomic. Also, printf(3) itself is thread-safe on GNU/Linux, as required by C++17, and probably by POSIX long before that.)
Stdio locking in printf happens to be enough serialization in almost all cases: the thread that just unlocked stdout and left printf can do the atomic increment and then try to take the stdout lock again.
The other threads were all blocked trying to take the lock on stdout. One (other?) thread can wake up and take the lock on stdout, but for its increment to race with the other thread it would have to enter and leave printf and load a the first time before that other thread commits its a = ... seq-cst store.
This does not mean it's actually safe
Just that testing this specific version of the program (at least on x86) doesn't easily reveal the lack of safety. Interrupts or scheduling variations, including competition from other things running on the same machine, certainly could block a thread at just the wrong time.
My desktop has 8 logical cores so there were enough for every thread to get one, not having to get descheduled. (Although normally that would tend to happen on I/O or when waiting on a lock anyway).
With the sleep there, it is not unlikely for multiple threads to wake up at nearly the same time and race with each other in practice on real x86 hardware. It's so long that timer granularity becomes a factor, I think. Or something like that.
Redirecting output to a file
With stdout open on a non-TTY file, it's full-buffered instead of line-buffered, and doesn't always make a system call while holding the stdout lock.
(I got a 17MiB file in /tmp from hitting control-C a fraction of a second after running ./a.out > output.)
This makes it fast enough for threads to actually race with each other in practice, showing the expected bugs of duplicate values. (A thread reads a but loses ownership of the cache line before it stores (tmp)+1, resulting in two or more threads doing the same increment. And/or multiple threads reading the same value when they reload a after flushing their store buffer.)
1228589 unique lines (sort -u | wc) but total output of
1291035 total lines. So ~5% of the output lines were duplicates.
I didn't check if it was usually one value duplicated multiple times or if it was usually only one duplicate. Or how far backward the value ever jumped. If a thread happened to be stalled by an interrupt handler after loading but before storing val+1, it could be quite far. Or if it actually slept or blocked for some reason, it could rewind indefinitely far.

Correctness of 'Concurrency in Action' atomic operation example

I've been studing 'Concurency in Action' position for some time and I have a problem with understanding following example of code (Listing 5.2):
#include <vector>
#include <atomic>
#include <iostream>
std::vector<int> data;
std::atomic<bool> data_ready(false);
void reader_thread()
{
while(!data_ready.load())
{
std::this_thread::sleep(std::milliseconds(1));
}
std::cout<<”The answer=”<<data[0]<<”\n”;
}
void writer_thread()
{
data.push_back(42); //write of data
data_ready=true; //write to data_ready flag
}
The book explaines:
(...) The write of the data happens-before the write to the data_ready
flag (...)
My concern is that the sentence does not cover the out-of-order execution. From my understanding out of order execution may happen when at least two instruction do not have depended operands. Taking this into account:
data_ready=true
does not need anything from
data.push_back(42)
to be executed. As a result of that it is not guaranteed that:
The write of the data happens-before the write to the data_ready flag
Is my understadning correct or there is something in out-of-order execution that I don't understand causing misunderstaning of given example?
EDIT
Thank you for answers, it was helpful. My misunserstanding was a result of not knowing that atomic types not only prevents from partialy channing a variable, but also acts as memory barrier.
For example following code may be reordered in many combinations by either compiler or processor:
d=0;
b=5;
a=10
c=1;
Resulting with following order (one of many possibilities):
b=5;
a=10
c=1;
d=0;
It it is not a problem with single-thread code since none of expressions have depended operands on other, but on multithreaded application may result of undefined behaviour. For example following code (initial values: x=0 and y=0):
Thread 1: Thread 2:
x=10; while(y!=15);
y=15; assert(x==10);
Without reordering of code by compiler or reordering execution by processor we could say: "Since assigement y=15 allways happens after assigement x=10 and assert happens after while loop the assert will never fail" But it's not true. The real execution order may be as below (one of many possible combinations):
Thread 1: Thread 2:
x=10; (4) while(y!=15); (3)
y=15; (1) assert(x==10); (2)
By default an atomic variable ensures sequentionally consistent ordering. If y in example above was atomic with memory_order_seq_cst default parameter following sentences are true:
- what happens before in thread 1 (x=10) it is also visible in thread 2 as happening before.
- what happens after while(y!=15) in thread 2 it is also visible in thread 1 as happening after
As a result of it assert will never fail.
Some of sources that may help with understaning:
Memory model synchronization modes - GCC
CppCon 2015: Michael Wong “C++11/14/17 atomics and memory
model..."
Memory barriers in C

I understand your concerns, but the code from book is fine. Every operation with atomics is by default memory_order_seq_cst, which means that everything that happened before the write in one of threads happens before read in the rest. You can imagine atomic operations with this std::memory_order like this:
std::atomic<bool> a;
//equivalent of a = true
a.assign_and_make_changes_from_thread_visible(true);
//equvalent of a.load()
a.get_value_and_changes_from_threads();

From Effective Modern C++, Item 40, it says "std::atomics imposes restrictions on how code can be reordered, and one such restriction is that no code that, in the source cod, precedes a write of std::atomic variable may take place afterwards." The note is this is true for when using sequential consistency which is a fair assumption.

Is a write to variable on a single core CPU atomic?

I have a single core CPU (ARM Cortex M3, 32bit) with two threads. Assuming the following situation:
// Thread 1:
int16_t a = 1;
double b = 1.0;
// Do some other fancy stuff including starting Thread 2
for (;;) {std::cout << a << "," <<b;}
// Thread 2:
a = 2;
b = 2.0;
I can handle the following outputs:
1,1
1,2
2,1
2,2
Can I be certain that the output will always be one of those (1/2) without using mutex or other locking mechanisms? And more specifically, is this compiler dependent? And is the behavior different for int16 and double?

It depends on the CPU, mostly, though in theory anything involving multiple threads on pre-C11 is at best implementation defined and at worst undefined behavior, so the compiler might do just about anything.
If you can ignore crazy compilers that do silly things, and assume that the compiler will use the CPU's facilities in a reasonable way, it depends mostly on what the CPU supports.
Cortex-M3 is a 32-bit CPU with a 32-bit bus and no FPU. So reads and writes of 32-bit and smaller values will generally be atomic. double, however, is 64 bits, so any read/write of a double will involve two instructions and be non-atomic. Thus if one thread reads while the other is writing you might get half from one value and half from the other.
Now in your specific example, the values 1.0 and 2.0 are both 0 for their lower half, so the 'mix' would be innocuous, but other values will not have that behavior.

The evaluation order of the operations before the ; are not guaranteed left to right, even if it were, they are not atomic, you will get a segfault if you try and read and write to a at the same time (they can & do take multiple cycles to perform and a context switch can interrupt it).
On arm in particular reads and writes go in a queue on the cpu where they are free to be reordered (except past memory barriers), even on a cpu that doesn't reorder memory the compiler is also free to reorder them. There is nothing stopping your assignment and read being moved forward or back and so you cannot guarantee the state of any of the values or ordering.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js