C++ atomics memory ordering for some specific use case

C++ atomics memory ordering for some specific use case - c++

I'm in the next situation where I use a atomic<uint64_t> as a counter and increment it from 5 or more threads and use the value before increment to take some decision.
atomic<uint64_t> global_counter;
void thread_funtion(){
uint64_t local_counter = global_counter.fetch_add(1,std::memory_order_relaxed);
if(local_counter == 24)
do_somthing(local_counter);
}
thread_funtion() will be executed by 5 different threads. Once I got the local_counter my code doesn't care anymore if the global_counter changes again while thread_funtion() is running (the business logic is in such a way that I only need have a unique incrementing value per thread_function() call).
Is std::memory_order_relaxed safe to be used in this case ?

atomic<...>::fetch_add(..., std::memory_order_relaxed) guarantees the atomic execution, but nothing more.
But even with memory_order_relaxed, there will be one, and only one thread calling do_something(). Since this fetch_add is the only operation on global_counter, and it is executed atomically, the value 24 must be reached exactly once. But there is no guarantee which thread it will be.

Related

C++: std::memory_order in std::atomic_flag::test_and_set to do some work only once by a set of threads

Could you please help me to understand what std::memory_order should be used in std::atomic_flag::test_and_set to do some work only once by a set of threads and why? The work should be done by whatever thread gets to it first, and all other threads should just check as quickly as possible that someone is already going the work and continue working on other tasks.
In my tests of the example below, any memory order works, but I think that it is just a coincidence. I suspect that Release-Acquire ordering is what I need, but, in my case, only one memory_order can be used in both threads (it is not the case that one thread can use memory_order_release and the other can use memory_order_acquire since I do not know which thread will arrive to doing the work first).
#include <atomic>
#include <iostream>
#include <thread>
std::atomic_flag done = ATOMIC_FLAG_INIT;
const std::memory_order order = std::memory_order_seq_cst;
//const std::memory_order order = std::memory_order_acquire;
//const std::memory_order order = std::memory_order_relaxed;
void do_some_work_that_needs_to_be_done_only_once(void)
{ std::cout<<"Hello, my friend\n"; }
void run(void)
{
if(not done.test_and_set(order))
do_some_work_that_needs_to_be_done_only_once();
}
int main(void)
{
std::thread a(run);
std::thread b(run);
a.join();
b.join();
// expected result:
// * only one thread said hello
// * all threads spent as little time as possible to check if any
// other thread said hello yet
return 0;
}
Thank you very much for your help!

Following up on some things in the comments:
As has been discussed, there is a well-defined modification order M for done on any given run of the program. Every thread does one store to done, which means one entry in M. And by the nature of atomic read-modify-writes, the value returned by each thread's test_and_set is the value that immediately precedes its own store in the order M. That's promised in C++20 atomics.order p10, which is the critical clause for understanding atomic RMW in the C++ memory model.
Now there are a finite number of threads, each corresponding to one entry in M, which is a total order. Necessarily there is one such entry that precedes all the others. Call it m1. The test_and_set whose store is entry m1 in M must return the preceding value in M. That can only be the value 0 which initialized done. So the thread corresponding to m1 will see test_and_set return 0. Every other thread will see it return 1, because each of their modifications m2, ..., mN follows (in M) another modification, which must have been a test_and_set storing the value 1.
We may not be bothering to observe all of the total order M, but this program does determine which of its entries is first on this particular run. It's the unique one whose test_and_set returns 0. A thread that sees its test_and_set return 1 won't know whether it came 2nd or 8th or 96th in that order, but it does know that it wasn't first, and that's all that matters here.
Another way to think about it: suppose it were possible for two threads (tA, tB) both to load the value 0. Well, each one makes an entry in the modification order; call them mA and mB. M is a total order so one has to go before the other. And bearing in mind the all-important [atomics.order p10], you will quickly find there is no legal way for you to fill out the rest of M.
All of this is promised by the standard without any reference to memory ordering, so it works even with std::memory_order_relaxed. The only effect of relaxed memory ordering is that we can't say much about how our load/store will become visible with respect to operations on other variables. That's irrelevant to the program at hand; it doesn't even have any other variables.
In the actual implementation, this means that an atomic RMW really has to exclusively own the variable for the duration of the operation. We must ensure that no other thread does a store to that variable, nor the load half of a read-modify-write, during that period. In a MESI-like coherent cache, this is done by temporarily locking the cache line in the E state; if the system makes it possible for us to lose that lock (like an LL/SC architecture), abort and start again.
As to your comment about "a thread reading false from its own cache/buffer": the implementation mustn't allow that in an atomic RMW, not even with relaxed ordering. When you do an atomic RMW, you must read it while you hold the lock, and use that value in the RMW operation. You can't use some old value that happens to be in a buffer somewhere. Likewise, you have to complete the write while you still hold the lock; you can't stash it in a buffer and let it complete later.

relaxed is fine if you just need to determine the winner of the race to set the flag1, so one thread can start on the work and later threads can just continue on.
If the run_once work produces data that other threads need to be able to read, you'll need a release store after that, to let potential readers know that the work is finished, not just started. If it was instead just something like printing or writing to a file, and other threads don't care when that finishes, then yeah you have no ordering requirements between threads beyond the modification order of done which exists even with relaxed. An atomic RMW like test_and_set lets you determines which thread's modification was first.
BTW, you should check read-only before even trying to test-and-set; unless run() is only called very infrequently, like once per thread startup. For something like a static int foo = non_constant; local var, compilers use a guard variable that's loaded (with an acquire load) to see if init is already complete. If it's not, branch to code that uses an atomic RMW to modify the guard variable, with one thread winning, the rest effectively waiting on a mutex for that thread to init.
You might want something like that if you have data that all threads should read. Or just use a static int foo = something_to_run_once(), or some type other than int, if you actually have some data to init.
Or perhaps use C++11 std::call_once to solve this problem for you.
On normal systems, atomic_flag has no advantage over and atomic_bool. done.exchange(true) on a bool is equivalent to test_and_set of a flag. But atomic_bool is more flexible in terms of the operations it supports, like plain read that isn't part of an RMW test-and-set.
C++20 does add a test() method for atomic_flag. ISO C++ guarantees that atomic_flag is lock-free, but in practice so is std::atomic<bool> on all real-world systems.
Footnote 1: why relaxed guarantees a single winner
The memory_order parameter only governs ordering wrt. operations on other variables by the same thread.
Does calling test_and_set by a thread force somehow synchronization of the flag with values written by other threads?
It's not a pure write, it's an atomic read-modify-write, so the result of the one that went first is guaranteed to be visible to the one that happens to be second. That's the whole point of test-and-set as a primitive building block for mutual exclusion.
If two TAS operations could both load the original value (false), and then both store true, they would be atomic. They'd have overlapped with each other.
Two atomic RMWs on the same atomic object must happen in some order, the modification-order of that object. (Because they're not read-only: an RMW includes a modification. But also includes a read so you can see what the value was immediately before the new value; that read is tied to the modification order, unlike a plain read).
Every atomic object separately has a modification-order that all threads can agree on; this is guaranteed by ISO C++. (With orders less than seq_cst, ordering between objects can be different from source order, and not guaranteed that all threads even agree which store happened first, the IRIW problem.)
Being an atomic RMW guarantees that exactly one test_and_set will return false in thread A or B. Same for fetch_add with multiple threads incrementing a counter: the increments have to happen in some order (i.e. serialized with each other), and whatever that order is becomes the modification-order of that atomic object.
Atomic RMWs have to work this way to not lose counts. i.e. to actually be atomic.

std::atomic - behaviour of relaxed ordering

Can the following call to print result in outputting stale/unintended values?
std::mutex g;
std::atomic<int> seq;
int g_s = 0;
int i = 0, j = 0, k = 0; // ignore fact that these could easily made atomic
// Thread 1
void do_work() // seldom called
{
// avoid over
std::lock_guard<std::mutex> lock{g};
i++;
j++;
k++;
seq.fetch_add(1, std::memory_order_relaxed);
}
// Thread 2
void consume_work() // spinning
{
const auto s = g_s;
// avoid overhead of constantly acquiring lock
g_s = seq.load(std::memory_order_relaxed);
if (s != g_s)
{
// no lock guard
print(i, j, k);
}
}

TL:DR: this is super broken; use a Seq Lock instead. Or RCU if your data structure is bigger.
Yes, you have data-race UB, and in practice stale values are likely; so are inconsistent values (from different increments). ISO C++ has nothing to say about what will happen, so it depends on how it happens to compile for some real machine, and interrupts / context switches in the reader that happen in the middle of reading some of these multiple vars. e.g. if the reader sleeps for any reason between reading i and j, you could miss many updates, or at least get a j that doesn't match your i.
Relaxed seq with writer+reader using lock_guard
I'm assuming the writer would look the same, so the atomic RMW increment is inside the critical section.
I'm picturing the reader checking seq like it is now, and only taking a lock after that, inside the block that runs print.
Even if you did use lock_guard to make sure the reader got a consistent snapshot of all three variables (something you couldn't get from making each of them separately atomic), I'm not sure relaxed would be sufficient in theory. It might be in practice on most real implementations for real machines (where compilers have to assume there might be a reader that synchronizes a certain way, even if there isn't in practice). I'd use at least release/acquire for seq, if I was going to take a lock in the reader.
Taking a mutex is an acquire operation, same as a std::memory_order_acquire load on the mutex object. A relaxed increment inside a critical section can't become visible to other threads until after the writer has taken the lock.
But in the reader, with if( xyz != seq.load(relaxed) ) { take_lock; ... }, the load is not guaranteed to "happen before" taking the lock. In practice on many ISAs it will, especially x86 where all atomic RMWs are full memory barriers. But in ISO C++, and maybe some real implementations, it's possible for the relaxed load to reorder into the reader's critical section. Of course, ISO C++ doesn't define things in terms of "reordering", only in terms of syncing with and values loads are allowed to see.
(This reordering may not be fully plausible; it would mean the read side would have to actually take the lock based on branch prediction / speculation on the load result. Maybe with lock elision like x86 did with transactional memory, except without x86's strong memory ordering?)
Anyway, it's pretty hairly to reason about, and release / acquire ops are quite cheap on most CPUs. If you expected it to be expensive, and for the check to often be false, you could check again with an acquire load, or put an acquire fence inside the if so it doesn't happen on the no-new-work path.
Use a Seq Lock
Your problem is better solved by using your sequence counter as part of a Seq Lock, so neither reader nor writer needs a mutex. (Summary: increment before writing, then touch the payload, then increment again. In the reader, read i, j, and k into local temporaries, then check the sequence number again to make sure it's the same, and an even number. With appropriate memory barriers.
See the wikipedia article and/or link below for actual details, but the real change from what you have now is that the sequence number has to increment by 2. If you can't handle that, use a separate counter for the actual lock, with seq as part of the payload.)
If you don't want to use a mutex in the reader, using one in the writer only helps in terms of implementation-detail side-effects, like making sure stores to memory actually happen, not keeping i in a register across calls if do_work inlines into some caller.
BTW, updating seq doesn't need to be an atomic RMW if there's only one writer. You can relaxed load and separately store an incremented temporary (with release semantics).
A Seq Lock is good for cheap reads and occasional writes that make the reader retry. Implementing 64 bit atomic counter with 32 bit atomics shows appropriate fencing.
It relies on non-atomic reads that may have a data race, but not using the result if your sequence counter detects tearing. C++ doesn't define the behaviour in that case, but it works in practice on real implementations. (C++ is mostly keeping its options open in case of hardware race detection, which normal CPUs don't do.)
If you have multiple writers, you'd still use a normal lock to give mutual exclusion between them. or use the sequence counter as a spinlock, as a writer acquires it by making the count odd. Otherwise you just need the sequence counter.
Your global g_s is just to track the latest sequence number the reader has seen? Storing it next to the data defeats some of the purpose/benefit, since it means the reader is writing the same cache line as the writer, assuming that variables declared near each other all end up together. Consider making it static inside the function, or separate it with other stuff, or with padding, like alignas(64) or 128. (That wouldn't guarantee that a compiler doesn't put it right before the other vars, though; a struct would let you control the layout of all of them. With enough alignment, you can make sure they're not in the same aligned pair of cache lines.)

Even ignoring the staleness, this is causes a data race and UB.
Thread 2 can read i,j,k while thread 1 is modifying them, you don't synchronize the access to those variables. If thread 2 doesn't respect the g, there's no point in locking it in thread 1.

Yes, it can.
First of all, the lock guard does not have any effect on your code. A lock has to be used by at least two threads to have any effect.
Thread 2 can read at any moment. It can read an incremented i and not incremented j and k. In theory, it can even read a weird partial value obtained by reading in between updating the various bytes that compose i - for example incrementing from 0xFF to 0x100 results reading 0x1FF or 0x0 - but not on x86 where these updates happen to be atomic.

How do we force variable sharing?

Consider the following code:
std::atomic<bool> flag(false);
//Thread 1
flag.store(true,std::memory_order_relaxed);
//Thread 2
while(!flag.load(std::memory_order_relaxed)) ; // stay in the loop
std::cout << "loaded";
Is there any gurantee that the last line ever gets executed?
If the answer is no, how it should be fixed (with as minimum overhead as possible)?

Yes, the last line is guaranteed to be executed eventually [intro.progress]/18
An implementation should ensure that the last value (in modification order) assigned by an atomic or synchronization operation will become visible to all other threads in a finite period of time.
Since your flag is atomic and is also the only thing any thread ever accesses in any way, there are no data races here. Since there are no loads or stores to objects other than your atomic to begin with, your program cannot possibly depend on any particular ordering of such non-existent loads or stores relative to the loads and stores to your atomic. Thus, relaxed memory order is perfectly sufficient. Since it is guaranteed that the atomic store in thread 1 will eventually become visible to thread 2, the loop is guaranteed to eventually terminate…

Is x++; threadsafe?

If I update a variable in one thread like this:
receiveCounter++;
and then from another thread I only ever read this variable and write its value to a GUI.
Is that safe? Or could this instruction be interrupted in the middle so the value in receiveCounter is wrong when it is read by another thread? it must be so right since ++ is not atomic, it is several instructions.
I don't care about synchronizing reads and writes, it just needs to be incremented and then update in the GUI but this does not have to happen directly after each other.
What I care about is that the value cannot be wrong. Like the ++ operation being interrupted in the middle so the read value is completely off.
Do I need to lock this variable? I really do not want to since it is update very often. I could solves this by just posting a message to a MAIN thread and copy the value to a Queue (which then would need to be locked but I would not do this on every update) I guess.
But I am interested in the above problem anyway.

If one thread changes the value in a variable and another thread reads that value and the program does not synchronize the accesses it has a data race, and the behavior of the program is undefined. Change the type of receiveCounter to std::atomic<int> (assuming it's an int to begin with)

At its core it is a read-modify-write operation, they are not atomic. There are some processors around that have a dedicated instruction for it. Like Intel/AMD cores, very common, they have an INC instruction.
While that sounds like that could be atomic, since it is a single instruction, it still isn't. The x86/x64 instruction set doesn't have much to do anymore with the way the execution engine is actually implemented. Which executes RISC-like "micro-ops", the INC instruction is translated to multiple micro-ops. It can be made atomic with the LOCK prefix on the instruction. But compilers don't emit it unless they know that an atomic update is desired.
You will thus need to be explicit about it. The C++11 std::atomic<> standard addition is a good way. Your operating system or compiler will have intrinsics for it, usually named something like "Interlocked" or "__built_in".

Simple answer: NO.
because i++; is the same as i = i + 1;, it contains load, math-operation, and saving the value. so it is in general not atomic.
However the really executed operations depend on the instruction set of the CPU and might be atomic, depending on the CPU architecture. But the default is still not atomic.

Generally speaking, it is not thread-safe because the ++ operator consists in one read and one write, the pair of which is not atomic and can be interrupted in between.
Then, it probably also depends on the language/compiler/architecture, but in a typical case the increment operation is probably not thread safe.
(edited)
As for the read and write operations themselves, as long as you are not in 64-bit mode, they should be atomic, so if you don't care about other threads having the value wrong by 1, it may be ok for your case.

No. Increment operation is not atomic, hence not thread-safe.
In your case it is safe to use this operation if you don't care about its value in specific time (if you only read this variable from another thread and not trying to write to it). It will eventually increment receiveCounter's value, you just don't have any guarantees about operations order.

++ is equivalent to i=i+1.
++ is not an atomic operation so its NOT thread safe. Only read and write of primitive variables (except long and double) are atomic and hence thread safe.

thread synchronization - delicate issue

let's i have this loop :
static a;
for (static int i=0; i<10; i++)
{
a++;
///// point A
}
to this loop 2 threads enters...
i'm not sure about something.... what will happen in case thread1 gets into POINT A , stay there, while THREAD2 gets into the loop 10 times, but after the 10'th loop after incrementing i's value to 10, before checking i's value if it's less then 10,
Thread1 is getting out of the loop and suppose to increment i and get into the loop again.
what's the value that Thread1 will increment (which i will he see) ? will it be 10 or 0 ?
is it posibble that Thread1 will increment i to 1, and then thread 2 will get to the loop again for 9 times (and them maybe 8 ,7 , etc...)
thanks

You have to realize that an increment operation is effectively really:
read the value
add 1
write the value back
You have to ask yourself, what happens if two of these happen in two independent threads at the same time:
static int a = 0;
thread 1 reads a (0)
adds 1 (value is 1)
thread 2 reads a (0)
adds 1 (value is 1)
thread 1 writes (1)
thread 2 writes (1)
For two simultaneous increments, you can see that it is possible that one of them gets lost because both threads read the pre-incremented value.
The example you gave is complicated by the static loop index, which I didn't notice at first.
Since this is c++ code, standard implementation is that the static variables are visible to all threads, thus there is only one loop counting variable for all threads. The sane thing to do would be to use a normal auto variable, because each thread would have its own, no locking required.
That means that while you will lose increments sometimes, you also may gain them because the loop itself may lose count and iterate extra times. All in all, a great example of what not to do.

If i is shared between multiple threads, all bets are off. It's possible for any thread to increment i at essentially any point during another thread's execution (including halfway through that thread's increment operation). There is no meaningful way to reason about the contents of i in the above code. Don't do that. Either give each thread its own copy of i, or make the increment and comparison with 10 a single atomic operation.

It's not really a delicate issue because you would never allow this in real code if the synchronization was going to be an issue.

I'm just going to use i++ in your loop:
for (static int i=0; i<10; i++)
{
}
Because it mimics a. (Note, static here is very strange)
Consider if Thread A is suspended just as it reaches i++. Thread B gets i all the way to 9, goes into i++ and makes it 10. If it got to move on, the loop would exist. Ah, but now Thread A is resumed! So it continues where it left off: increment i! So i becomes 11, and your loop is borked.
Any time threads share data, it needs to be protected. You could also make i++ and i < 10 happen atomically (never be interrupted), if your platform supports it.

You should use mutual exclusion to solve this problem.

And that is why, on multi-threaded environment, we are suppose to use locks.
In your case, you should write:
bool test_increment(int& i)
{
lock()
++i;
bool result = i < 10;
unlock();
return result;
}
static a;
for(static int i = -1 ; test_increment(i) ; )
{
++a;
// Point A
}
Now the problem disappears .. Note that lock() and unlock() are supposed to lock and unlock a mutex common to all threads trying to access i!

Yes, it's possible that either thread can do the majority of the work in that loop. But as Dynite explained, this would (and should) never show up in real code. If synchronization is an issue, you should provide mutual exclusion (a Boost, pthread, or Windows Thread) mutex to prevent race conditions such as this.

Why would you use a static loop counter?
This smells like homework, and a bad one at that.

Both the threads have their own copy of i, so the behavior can be anything at all. That's part of why it's such a problem.
When you use a mutex or critical section the threads will generally sync up, but even that is not absolutely guaranteed if the variable is not volatile.
And someone will no doubt point out "volatile has no use in multithreading!" but people say lots of stupid things. You don't have to have volatile but it is helpful for some things.

If your "int" is not the atomic machine word size (think 64 bit address + data emulating a 32-bit VM) you will "word-tear". In that case your "int" is 32 bits, but the machine addresses 64 atomically. Now you have to read all 64, increment half, and write them all back.
This is a much larger issue; bone up on processor instruction sets, and grep gcc for how it implements "volatile" everywhere if you really want the gory details.
Add "volatile" and see how the machine code changes. If you aren't looking down at the chip registers, please just use boost libraries and be done with it.

If you need to increment a value with multiple threads at the same time, then Look up "atomic operations". For linux, look up "gcc atomic operations". There is hardware support on most platforms to atomicly increment, add, compare and swap, and more. LOCKING WOULD BE OVERKILL for this....atomic inc is magnitudes faster than lock inc unlock. If you have to change a lot of fields at the same time you may need a lock, although you can change 128 bits worth of fields at a time with most atomic ops.
volatile is not the same as an atomic operation. Volatile helps the compiler know when its a bad idea to use a copy of a variable. Among its uses, volatile is important when you have multiple threads changing data that you would like to read the "most up to date version of" of without locking. Volatile will still not fix your a++ problem as two threads can read the value of "a" at the same time and then both increment the same "a" and then the last one to write "a" wins and you lost an inc. Volatile will slow down optimized code by not letting the compiler hold values in registers and what not.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js