C++ atomics reading stale value - c++

I'm reading the C++ Concurrency in Action book and I'm having trouble understanding the visibility of writes to atomic variables.
Lets say we have a
std::atomic<int> x = 0;
and we read/write with sequential consistent ordering
1. ++x;
// <-- thread 2
2. if (x == 1) {
// <-- thread 1
}
If we have 2 threads that execute the code above.
Is it possible that thread 1 arrives at line 2. and reads x == 1, after thread 2 already executed line 1.?
So does the sequential consistent ++x of thread 2 instantly gets propagated to thread t1 or is it possible that thread 1 reads a stale value x == 1?
I think if we use relaxed_ordering or acq/rel the above situation is possible, but how about the sequential consistent ordering?

If you're thinking that multiple atomic operations are somehow safely grouped, you're wrong. They'll always occur in order within that thread, and they'll be visible in that order, but there is no guarantee that two separate operations will occur in one thread before either occurs in the other.
So for your specific question "Is it possible that thread 1 arrives at line 2. and reads x == 1, after thread 2 already executed line 1.?", the answer is yes, thread 1 could reach the x == 1 test after thread 2 has incremented x as well, so x would already be 2 and neither thread would see x == 1 as true.
The simplest way to think about this is to imagine a single processor system, and consider what happens if the running thread is switched out at any time aside from the middle of a single atomic operation.
So in this case, the operations (inc1 and test1 for thread 1 and inc2 and test2 for thread 2) could occur in any of the following orders:
inc1 test1 inc2 test2
inc1 inc2 test1 test2
inc1 inc2 test2 test1
inc2 inc1 test1 test2
inc2 inc1 test2 test1
inc2 test2 inc1 test1
As you see, there is no possibility of either test occurring before either increment, nor can both tests pass (because the only way a test passes is if the increment associated with it on that thread has occurred but not the increment on the other thread), but there's no guarantee any test passes (both increments could precede both tests, causing both tests to test against the value 2 and neither test to pass). The race window is narrow, so most of the time you'd probably see exactly one test pass, but it's wholly possible to get unlucky and have neither pass.
If you want to make this work reliably, you need to make sure you both modify and test in a single operation, so exactly one thread will see the value as being 1:
if (++x == 1) { // The first thread to get here will do the stuff
// Do stuff
}
In this case, the increment and read are a single atomic operation, so the first thread to get to that line (which might be thread 1 or thread 2, no guarantees) will perform the first increment with ++x atomically returning the new value which is tested. Two threads can't both see x become 1, because we kept both increment and test as one operation.
That said, if you're relying on the content of that if being completed before any thread executes code after the if, that won't work; the first thread could enter the if, while the second thread arrives nanoseconds later and skips it, realizing it wasn't the first to get there, and it would immediately begin executing the code after the if even if the first thread hasn't finished. Simple use of atomics like this is not suited for a "run only once" scenario that people often write this code for when the "run only once" code must be run exactly once before dependent code is executed.

Let's simplify your question.
When 2 threads execute func()
std::atomic <int> x=0;
void func()
{
++x;
std::cout << x;
}
Following result is possible?
11
And the answer is NO!
Only "12" or "21" is possible.
The sequential consistency on a atomic variable works as you want on this simple case.

Related

How does atomic synchronization work on a single thread when it gets migrated to another core

Asking this question as a pseudo code, and also targeting both rust and c++ as memory model concepts are ditto
SomeFunc(){
x = counter.load(Ordering::Relaxed) //#1
counter.store(x+1, Ordering::Relaxed) //#2
y = counter.load(Ordering::Relaxed) //#3
}
Question: Imagine SomeFunc is being executed by a thread and between #2 and #3 the thread gets interrupted and now #3 executes on different core, in this case does counter variable get synchronized with the last updated value (core 1) when it runs on another core2 (there is no explicit release/acquire). I suppose the entire cache line+thread local storage gets shelved and loaded when the thread briefly goes to sleep and comes back running on different core?
First of all, it should be noted that atomic instructions add synchronization, and do not remove it.
Would you expect:
unsigned func(unsigned* counter) {
auto x = *counter;
*counter = x + 1;
auto y = *counter;
return y;
}
To return anything else than the original value of *counter + 1?
Yet, similarly, the thread could be moved between cores in-between two statements!
The above code executes fine even when the core is moved because the OS takes care during the switch to appropriately synchronize between cores to preserve user-space program order.
So, what happens when using atomics on a single thread?
Well, you add a bit of processing overhead -- more synchronization -- and the OS still takes care during the switch to appropriately synchronize.
Hence the effect is strictly the same.

output 10 with memory_order_seq_cst

When i run this program i get output as 10 which seems to be impossible for me. I'm running this on x86_64 core i3 ubuntu.
If the output is 10, then 1 must have come from either c or d.
Also in thread t[0], we assign c as 1. Now a is 1 since it occurs before c=1. c is equal to b which was set to 1 by thread 1. So when we store d it should be 1 as a=1.
Can output 10 happen with memory_order_seq_cst ? I tried inserting a atomic_thread_fence(seq_cst) on both thread between 1st (variable =1 ) and 2nd line (printf) but it still didn't work.
Uncommenting both the fence doesn't work.
Tried running with g++ and clang++. Both give the same result.
#include<thread>
#include<unistd.h>
#include<cstdio>
#include<atomic>
using namespace std;
atomic<int> a,b,c,d;
void foo(){
a.store(1,memory_order_seq_cst);
// atomic_thread_fence(memory_order_seq_cst);
c.store(b,memory_order_seq_cst);
}
void bar(){
b.store(1,memory_order_seq_cst);
// atomic_thread_fence(memory_order_seq_cst);
d.store(a,memory_order_seq_cst);
}
int main(){
thread t[2];
t[0]=thread(foo); t[1]=thread(bar);
t[0].join();t[1].join();
printf("%d%d\n",c.load(memory_order_seq_cst),d.load(memory_order_seq_cst));
}
bash$ while [ true ]; do ./a.out | grep "10" ; done
10
10
10
10
10 (c=1, d=0) is easily explained: bar happened to run first, and finished before foo read b.
Quirks of inter-core communication to get threads started on different cores means it's easily possible for this to happen even though thread(foo) ran first in the main thread. e.g. maybe an interrupt arrived at the core the OS chose for foo, delaying it from actually getting into that code1.
Remember that seq_cst only guarantees that some total order exists for all seq_cst operations which is compatible with the sequenced-before order within each thread. (And any other happens-before relationship established by other factors). So the following order of atomic operations is possible without even breaking out the a.load2 in bar separately from the d.store of the resulting int temporary.
b.store(1,memory_order_seq_cst); // bar1. b=1
d.store(a,memory_order_seq_cst); // bar2. a.load reads 0, d=0
a.store(1,memory_order_seq_cst); // foo1
c.store(b,memory_order_seq_cst); // foo2. b.load reads 1, c=1
// final: c=1, d=0
atomic_thread_fence(seq_cst) has no impact anywhere because all your operations are already seq_cst. A fence basically just stops reordering of this thread's operations; it doesn't wait for or sync with fences in other threads.
(Only a load that sees a value stored by another thread can create synchronization. But such a load doesn't wait for the other store; it has no way of knowing there is another store. If you want to keep loading until you see the value you expect, you have to write a spin-wait loop.)
Footnote 1:
Since all your atomic vars are probably in the same cache line, even if execution did reach the top of foo and bar at the same time on two different cores, false-sharing is likely going to let both operations from one thread happen while the other core is still waiting to get exclusive ownership. Although seq_cst stores are slow enough (on x86 at least) that hardware fairness stuff might relinquish exclusive ownership after committing the first store of 1. Anyway, lots of ways for both operations in one thread to happen before the other thread and get 10 or 01. Even possible to get 11 if we get b=1 then a=1 before either load. Using seq_cst does stop the hardware from doing the load early (before the store is globally visible), so it's very possible.
Footnote 2: The lvalue-to-rvalue evaluation of bare a uses the overloaded (int) conversion which is equivalent to a.load(seq_cst). The operations from foo could happen between that load and the d.store that gets a temporary value from it. d.store(a) is not an atomic copy; it's equivalent to int tmp = a; d.store(tmp);. That isn't necessary to explain your observations.
The printf statements are unsynchronized so output of 10 can be just a reordered 01.
01 happens when the functions before the printf run serially.

Why my std::atomic<int> variable isn't thread-safe?

I don't know why my code isn't thread-safe, as it outputs some inconsistent results.
value 48
value 49
value 50
value 54
value 51
value 52
value 53
My understanding of an atomic object is it prevents its intermediate state from exposing, so it should solve the problem when one thread is reading it and the other thread is writing it.
I used to think I could use std::atomic without a mutex to solve the multi-threading counter increment problem, and it didn't look like the case.
I probably misunderstood what an atomic object is, Can someone explain?
void
inc(std::atomic<int>& a)
{
while (true) {
a = a + 1;
printf("value %d\n", a.load());
std::this_thread::sleep_for(std::chrono::milliseconds(1000));
}
}
int
main()
{
std::atomic<int> a(0);
std::thread t1(inc, std::ref(a));
std::thread t2(inc, std::ref(a));
std::thread t3(inc, std::ref(a));
std::thread t4(inc, std::ref(a));
std::thread t5(inc, std::ref(a));
std::thread t6(inc, std::ref(a));
t1.join();
t2.join();
t3.join();
t4.join();
t5.join();
t6.join();
return 0;
}
I used to think I could use std::atomic without a mutex to solve the multi-threading counter increment problem, and it didn't look like the case.
You can, just not the way you have coded it. You have to think about where the atomic accesses occur. Consider this line of code …
a = a + 1;
First the value of a is fetched atomically. Let's say the value fetched is 50.
We add one to that value getting 51.
Finally we atomically store that value into a using the = operator
a ends up being 51
We atomically load the value of a by calling a.load()
We print the value we just loaded by calling printf()
So far so good. But between steps 1 and 3 some other threads may have changed the value of a - for example to the value 54. So, when step 3 stores 51 into a it overwrites the value 54 giving you the output you see.
As #Sopel and #Shawn suggest in the comments, you can atomically increment the value in a using one of the appropriate functions (like fetch_add) or operator overloads (like operator ++ or operator +=. See the std::atomic documentation for details
Update
I added steps 5 and 6 above. Those steps can also lead to results that may not look correct.
Between the store at step 3. and the call tp a.load() at step 5. other threads can modify the contents of a. After our thread stores 51 in a at step 3 it may find that a.load() returns some different number at step 5. Thus the thread that set a to the value 51 may not pass the value 51 to printf().
Another source of problems is that nothing coordinates the execution of steps 5. and 6. between two threads. So, for example, imagine two threads X and Y running on a single processor. One possible execution order might be this …
Thread X executes steps 1 through 5 above incrementing a from 50 to 51 and getting the value 51 back from a.load()
Thread Y executes steps 1 through 5 above incrementing a from 51 to 52 and getting the value 52 back from a.load()
Thread Y executes printf() sending 52 to the console
Thread X executes printf() sending 51 to the console
We've now printed 52 on the console, followed by 51.
Finally, there's another problem lurking at step 6. because printf() doesn't make any promises about what happens if two threads call printf() at the same time (at least I don't think it does).
On a multiprocessor system threads X and Y above might call printf() at exactly the same moment (or within a few ticks of exactly the same moment) on two different processors. We can't make any prediction about which printf() output will appear first on the console.
Note The documentation for printf mentions a lock introduced in C++17 "… used to prevent data races when multiple threads read, write, position, or query the position of a stream." In the case of two threads simultaneously contending for that lock we still can't tell which one will win.
Besides the increment of a being done non-atomically, the fetch of the value to display after the increment is non-atomic with respect to the increment. It is possible that one of the other threads increments a after the current thread has incremented it but before the fetch of the value to display. This would possibly result in the same value being shown twice, with the previous value skipped.
Another issue here is that the threads do not necessarily run in the order they have been created. Thread 7 could execute its output before threads 4, 5, and 6, but after all four threads have incremented a. Since the thread that did the last increment displays its output earlier, you end up with the output not being sequential. This is more likely to happen on a system with fewer than six hardware threads available to run on.
Adding a small sleep between the various thread creates (e.g., sleep_for(10)) would make this less likely to occur, but would still not eliminate the possibility. The only sure way to keep the output ordered is to use some sort of exclusion (like a mutex) to ensure only one thread has access to the increment and output code, and treat both the increment and output code as a single transaction that must run together before another thread tries to do an increment.
The other answers point out the non-atomic increment and various problems. I mostly want to point out some interesting practical details about exactly what we see when running this code on a real system. (x86-64 Arch Linux, gcc9.1 -O3, i7-6700k 4c8t Skylake).
It can be useful to understand why certain bugs or design choices lead to certain behaviours, for troubleshooting / debugging.
Use int tmp = ++a; to capture the fetch_add result in a local variable instead of reloading it from the shared variable. (And as 1202ProgramAlarm says, you might want to treat the whole increment and print as an atomic transaction if you insist on having your counts printed in order as well as being done properly.)
Or you might want to have each thread record the values it saw in a private data structure to be printed later, instead of also serializing threads with printf during the increments. (In practice all trying to increment the same atomic variable will serialize them waiting for access to the cache line; ++a will go in order so you can tell from the modification order which thread went in which order.)
Fun fact: a.store(1 + a.load(std:memory_order_relaxed), std::memory_order_release) is what you might do for a variable that was only written by 1 thread, but read by multiple threads. You don't need an atomic RMW because no other thread ever modifies it. You just need a thread-safe way to publish updates. (Or better, in a loop keep a local counter and just .store() it without loading from the shared variable.)
If you used the default a = ... for a sequentially-consistent store, you might as well have done an atomic RMW on x86. One good way to compile that is with an atomic xchg, or mov+mfence is as expensive (or more).
What's interesting is that despite the massive problems with your code, no counts were lost or stepped on (no duplicate counts), merely printing reordered. So in practice the danger wasn't encountered because of other effects going on.
I tried it on my own machine and did lose some counts. But after removing the sleep, I just got reordering. (I copy-pasted about 1000 lines of the output into a file, and sort -u to uniquify the output didn't change the line count. It did move some late prints around though; presumably one thread got stalled for a while.) My testing didn't check for the possibility of lost counts, skipped by not saving the value being stored into a, and instead reloading it. I'm not sure there's a plausible way for that to happen here without multiple threads reading the same count, which would be detected.
Store + reload, even a seq-cst store which has to flush the store buffer before it can reload, is very fast compared to printf making a write() system call. (The format string includes a newline and I didn't redirect output to a file so stdout is line-buffered and can't just append the string to a buffer.)
(write() system calls on the same file descriptor are serializing in POSIX: write(2) is atomic. Also, printf(3) itself is thread-safe on GNU/Linux, as required by C++17, and probably by POSIX long before that.)
Stdio locking in printf happens to be enough serialization in almost all cases: the thread that just unlocked stdout and left printf can do the atomic increment and then try to take the stdout lock again.
The other threads were all blocked trying to take the lock on stdout. One (other?) thread can wake up and take the lock on stdout, but for its increment to race with the other thread it would have to enter and leave printf and load a the first time before that other thread commits its a = ... seq-cst store.
This does not mean it's actually safe
Just that testing this specific version of the program (at least on x86) doesn't easily reveal the lack of safety. Interrupts or scheduling variations, including competition from other things running on the same machine, certainly could block a thread at just the wrong time.
My desktop has 8 logical cores so there were enough for every thread to get one, not having to get descheduled. (Although normally that would tend to happen on I/O or when waiting on a lock anyway).
With the sleep there, it is not unlikely for multiple threads to wake up at nearly the same time and race with each other in practice on real x86 hardware. It's so long that timer granularity becomes a factor, I think. Or something like that.
Redirecting output to a file
With stdout open on a non-TTY file, it's full-buffered instead of line-buffered, and doesn't always make a system call while holding the stdout lock.
(I got a 17MiB file in /tmp from hitting control-C a fraction of a second after running ./a.out > output.)
This makes it fast enough for threads to actually race with each other in practice, showing the expected bugs of duplicate values. (A thread reads a but loses ownership of the cache line before it stores (tmp)+1, resulting in two or more threads doing the same increment. And/or multiple threads reading the same value when they reload a after flushing their store buffer.)
1228589 unique lines (sort -u | wc) but total output of
1291035 total lines. So ~5% of the output lines were duplicates.
I didn't check if it was usually one value duplicated multiple times or if it was usually only one duplicate. Or how far backward the value ever jumped. If a thread happened to be stalled by an interrupt handler after loading but before storing val+1, it could be quite far. Or if it actually slept or blocked for some reason, it could rewind indefinitely far.

access order of std::atomic bool variable

I've a function that accesses(reads and writes to) a std::atomic<bool> variable. I'm trying to understand the order of execution of instructions so as to decide whether atomic will suffice or will I've to use mutexes here. The function is given below -
// somewhere member var 'executing' is defined as std::atomic<bool>`
int A::something(){
int result = 0;
// my intention is only one thread should enter next block
// others should just return 0
if(!executing){
executing = true;
...
// do some really long processing
...
result = processed;
executing = false;
}
return result;
}
I've read this page on cppreference which mentions -
Each instantiation and full specialization of the std::atomic template defines an atomic type. If one thread writes to an atomic object while another thread reads from it, the behavior is well-defined (see memory model for details on data races)
and on Memory model page the following is mentioned -
When an evaluation of an expression writes to a memory location and another evaluation reads or modifies the same memory location, the expressions are said to conflict. A program that has two conflicting evaluations has a data race unless either
both conflicting evaluations are atomic operations (see std::atomic)
one of the conflicting evaluations happens-before another (see std::memory_order)
If a data race occurs, the behavior of the program is undefined.
and slight below it reads -
When a thread reads a value from a memory location, it may see the initial value, the value written in the same thread, or the value written in another thread. See std::memory_order for details on the order in which writes made from threads become visible to other threads.
This is slightly confusing to me, which one of above 3 statements are actually happening here?
When I perform if(!executing){ is this instruction an atomic instruction here? and more important - is it guaranteed that no other thread will enter that if loop if one two threads will enter that if body since first one will set executing to true?
And if something's wrong with the mentioned code, how should I rewrite it so that it reflects original intention..
If I understand correctly, you are trying to ensure that only one thread will ever execute a stretch of code at the same time. This is exactly what a mutex does. Since you mentioned that you don't want threads to block if the mutex is not available, you probably want to take a look at the try_lock() method of std::mutex. See the documentation of std::mutex.
Now to why your code does not work as intended: Simplifying a little, std::atomic guarantees that there will be no data races when accessing the variable concurrently. I.e. there is a well defined read-write order.
This doesn't suffice for what you are trying to do. Just imagine the if branch:
if(!executing) {
executing = true;
Remember, only the read-write operations on executing are atomic. This leaves at least the negation ! and the if itself unsynchronized. With two threads, the execution order could be like this:
Thread 1 reads executing (atomically), value is false
Thread 1 negates the value read from executing, value = true
Thread 1 evaluates the condition and enters the branch
Thread 2 reads executing (atomically), value is false
Thread 1 set executing to true
Thread 2 negates the value, which was read as false and is now true again
Thread 2 enters the branch...
Now both threads have entered the branch.
I would suggest something along these lines:
std::mutex myMutex;
int A::something(){
int result = 0;
// my intention is only one thread should enter next block
// others should just return 0
if(myMutex.try_lock()){
...
// do some really long processing
...
result = processed;
myMutex.unlock();
}
return result;
}

C++ memory model - does this example contain a data race?

I was reading Bjarne Stroustrup's C++11 FAQ and I'm having trouble understanding an example in the memory model section.
He gives the following code snippet:
// start with x==0 and y==0
if (x) y = 1; // thread 1
if (y) x = 1; // thread 2
The FAQ says there is not a data race here. I don't understand. The memory location x is read by thread 1 and written to by thread 2 without any synchronization (and the same goes for y). That's two accesses, one of which is a write. Isn't that the definition of a data race?
Further, it says that "every current C++ compiler (that I know of) gives the one right answer." What is this one right answer? Couldn't the answer vary depending on whether one thread's comparison happens before or after the other thread's write (or if the other thread's write is even visible to the reading thread)?
// start with x==0 and y==0
if (x) y = 1; // thread 1
if (y) x = 1; // thread 2
Since neither x nor y is true, the other won't be set to true either. No matter the order the instructions are executed, the (correct) result is always x remains 0, y remains 0.
The memory location x is ... written to by thread 2
Is it really? Why do you say so?
If y is 0 then x is not written to by thread 2. And y starts out 0. Similarly, x cannot be non-zero unless somehow y is non-zero "before" thread 1 runs, and that cannot happen. The general point here is that conditional writes that don't execute don't cause a data race.
This is a non-trivial fact of the memory model, though, because a compiler that is not aware of threading would be permitted (assuming y is not volatile) to transform the code if (x) y = 1; to int tmp = y; y = 1; if (!x) y = tmp;. Then there would be a data race. I can't imagine why it would want to do that exact transformation, but that doesn't matter, the point is that optimizers for non-threaded environments can do things that would violate the threaded memory model. So when Stroustrup says that every compiler he knows of gives the right answer (right under C++11's threading model, that is), that's a non-trivial statement about the readiness of those compilers for C++11 threading.
A more realistic transformation of if (x) y = 1 would be y = x ? 1 : y;. I believe that this would cause a data race in your example, and that there is no special treatment in the standard for the assignment y = y that makes it safe to execute unsequenced with respect to a read of y in another thread. You might find it hard to imagine hardware on which it doesn't work, and anyway I may be wrong, which is why I used a different example above that's less realistic but has a blatant data race.
There has to be a total ordering of the writes, because of the fact that no thread can write to the variable x or y until some other thread has first written a 1 to either variable. In other words you have basically three different scenarios:
thread 1 gets to write to y because x was written to at some previous point before the if statement, and then if thread 2 comes later, it writes to x the same value of 1, and doesn't change it's previous value of 1.
thread 2 gets to write to x because y was changed at some point before the if statement, and then thread 1 will write to y if it comes later the same value of 1.
If there are only two threads, then the if statements are jumped over because x and y remain 0.
Neither of the writes occurs, so there is no race. Both x and y remain zero.
(This is talking about the problem of phantom writes. Suppose one thread speculatively did the write before checking the condition, then attempted to correct things after. That would break the other thread, so it isn't allowed.)
Memory model set the supportable size of code and data areas.before comparing linking source code,we need to specify the memory model that is he can set the size limitsthe data and code.