Acquire-release memory order between multiple threads

Acquire-release memory order between multiple threads - c++

Code:
std::atomic<int> done = 0;
int a = 10;
void f1() {
a = 20;
done.store(1, std::memory_order_release);
}
void f2() {
if(done.load(std::memory_order_acquired) == 1) {
assert(a == 20);
}
}
void f3() {
if(done.load(std::memory_order_acquired) == 1) {
assert(a == 20);
}
}
int main() {
std::thread t1(f1);
std::thread t2(f2);
std::thread t3(f3);
t1.join();
t2.join();
t3.join();
}
The question is if thread 2 & 3 both see done == 1, will
the assertion a == 20 hold in both threads ?
I know acquire-release works for a pair of threads. But does
it work in multiple threads as well?

Yes. The release-acquire relationship holds separately for all pairs of threads (that access the same atomic location!) and guarantees that all writes (that appear in program order) before the release are visible to all reads (that appear in program order) after whichever corresponding acquire.

A release is like publishing a newspaper (one with no defined periodicity), and acquire is like buying the later edition right now, then discovering what it says (not caring which edition it is, or which day it is). (You rarely need to do versioning on these shared atomics, although it can sometimes be needed.)
Any number of people can buy the newspaper. What matters is that what's printed was true when it was printed, and is still true if they are invariant truths, like the creation of a monument (invariable by hypothesis).
So a release operation publishes what is true at the time of publication, and proper design guarantees that these facts cannot have changed when you can "buy" (acquire) the publication. Any number of threads can see these facts.

Related

Why does only "std::memory_order_seq_cst" guarantee the result [duplicate]

#include <thread>
#include <atomic>
#include <cassert>
std::atomic<bool> x = {false};
std::atomic<bool> y = {false};
std::atomic<int> z = {0};
void write_x()
{
x.store(true, std::memory_order_release);
}
void write_y()
{
y.store(true, std::memory_order_release);
}
void read_x_then_y()
{
while (!x.load(std::memory_order_acquire))
;
if (y.load(std::memory_order_acquire)) {
++z;
}
}
void read_y_then_x()
{
while (!y.load(std::memory_order_acquire))
;
if (x.load(std::memory_order_acquire)) {
++z;
}
}
int main()
{
std::thread a(write_x);
std::thread b(write_y);
std::thread c(read_x_then_y);
std::thread d(read_y_then_x);
a.join(); b.join(); c.join(); d.join();
assert(z.load() != 0);
}
If I relplace seq_cst to acquire/release in cppreference's last example,
can assert(z.load() != 0) be fail ?
Seq_CST can prevent StoreLoad reorder, but the code hasn't.
Acquire can prevent LoadLoad reorder.
Release can prevent StoreStore reorder.

Yes, it is possible that z.load() == 0 in your code if you use acquire/release order as you've done. There is no happens-before relationship between the independent writes to x and y. It's not a coincidence cppreference used that example specifically to illustrate a case where acquire/release isn't sufficient.
This is sometimes call the IRIW (independent reads of independent writes), and tended to be glossed over in some hardware ordering models. In particular, a memory model defined only in terms of the possible load-load, load-store, store-store, etc, reorderings doens't say anything either way really about IRIW. In the x86 memory model IRIW reordering is disallowed because of a clause that explains that stores have a total order and all processors view stores in this same order.
I'm not aware if any CPUs in common use admit the IRIW reordering when the barriers and/or instructions needed for acquire and release are used, but I wouldn't be surprised if some do.

Yes, the assert can fire.
The principal property that is not guaranteed by acquire / release is a single total order of modifications. It only guarantees that (the non-existent) previous actions of a and b are observed by c and d if they see true from the loads.
A (slightly contrived) example of this is on a multi-cpu (physical socket) system that isn't fully cache-coherant. Die 1 has core A running thread a and core C running thread c. Die 2 has core B running thread b and core D running thread d. The interconnect between the two sockets has a long latency when compared to a memory operation that hits on-die cache.
a and b run at the same wall clock time. C is on-die with A, so can see the store to x immediately, but the interconnect delays it's observation of the store to y, so it sees the old value. Similarly D is on-die with B, so it sees the store to y, but misses the store to x.
Whereas if you have sequential consistency, some co-ordination is required to enforce a total order, such as "C and D are blocked while the interconnect syncs the caches".

C++ memory_order_acquire/release questions

I recently learn about c++ six memory orders, I felt very confusing about memory_order_acquire and memory_order_release, here is an example from cpp:
#include <thread>
#include <atomic>
#include <cassert>
std::atomic<bool> x = {false};
std::atomic<bool> y = {false};
std::atomic<int> z = {0};
void write_x() { x.store(true, std::memory_order_seq_cst); }
void write_y() { y.store(true, std::memory_order_seq_cst); }
void read_x_then_y() {
while (!x.load(std::memory_order_seq_cst));
if (y.load(std::memory_order_seq_cst))
++z;
}
void read_y_then_x() {
while (!y.load(std::memory_order_seq_cst));
if (x.load(std::memory_order_seq_cst))
++z;
}
int main() {
std::thread a(write_x);
std::thread b(write_y);
std::thread c(read_x_then_y);
std::thread d(read_y_then_x);
a.join(); b.join(); c.join(); d.join();
assert(z.load() != 0); // will never happen
}
In the cpp reference page, it says:
This example demonstrates a situation where sequential ordering is necessary.
Any other ordering may trigger the assert because it would be possible
for the threads c and d to observe changes to the atomics x and y in
opposite order.
So my question is why memory_order_acquire and memory_order_release can not be used here? And what semantics does memory_order_acquire and memory_order_release provide?
some references:
https://en.cppreference.com/w/cpp/atomic/memory_order
https://gcc.gnu.org/wiki/Atomic/GCCMM/AtomicSync

Sequential consistency provides a single total order of all sequentially consistent operations. So if you have a sequentially consistent store in thread A, and a sequentially consistent load in thread B, and the store is ordered before the load (in said single total order), then B observes the value stored by A. So basically sequential consistency guarantees that the store is "immediately visible" to other threads. A release store does not provide this guarantee.
As Peter Cordes pointed out correctly, the term "immediately visible" is rather imprecise. The "visibility" stems from the fact that all seq-cst operations are totally ordered, and all threads observe that order. Since the store and the load are totally ordered, the value of a store becomes visible before a subsequent load (in the single total order) is executed.
There exists no such total order between acquire/release operations in different threads, so there is not visibility guarantee. The operations are only ordered once an acquire-operations observes the value from a release-operation, but there is no guarantee when the value of the release-operation becomes visible to the thread performing the acquire-operation.
Let's consider what would happen if we were to use acquire/release in this example:
void write_x() { x.store(true, std::memory_order_release); }
void write_y() { y.store(true, std::memory_order_release); }
void read_x_then_y() {
while (!x.load(std::memory_order_acquire));
if (y.load(std::memory_order_acquire))
++z;
}
void read_y_then_x() {
while (!y.load(std::memory_order_acquire));
if (x.load(std::memory_order_acquire))
++z;
}
int main() {
std::thread a(write_x);
std::thread b(write_y);
std::thread c(read_x_then_y);
std::thread d(read_y_then_x);
a.join(); b.join(); c.join(); d.join();
assert(z.load() != 0); // can actually happen!!
}
Since we have no guarantee about visibility, it could happen that thread c observes x == true and y == false, while at the same time thread d could observe y == true and x == false. So neither thread would increment z and the assertion would fire.
For more details about the C++ memory model I can recommend this paper which I have co-authored: Memory Models for C/C++ Programmers

You can use aquire/release when passing information from one thread to another - this is the most common situation. No need for sequential requirements on this one.
In this example there are a bunch of threads. Two threads make write operation while third roughly tests whether x was ready before y and fourth tests whether y was ready before x. Theoretically one thread may observe that x was modified before y while another sees that y was modified before x. Not entirely sure how likely it is. This is an uncommon usecase.
Edit: you can visualize the example: assume that each threads is run on a different PC and they communicate via a network. Each pair of PCs has a different ping to each other. Here it is easy to make an example where it is unclear which event occurred first x or y as each PC will see the events occur in different order.
I am not sure on sure on which architectures this effect may occur but there are complex ones where two different processors are conjoined. Surely communication between the processors is slower than between cores of each processor.

Do locks in C++ 11 guarantee freshness of accessed data?

Usually, when using std::atomic types accessed concurrently by multiple threads, there's no guarantee a thread will read the "up to date" value when accessing them, and a thread may get a stale value from cache or any older value. The only way to get the up to date value are functions such as compare_exchange_XXX. (See questions here and here)
#include <atomic>
std::atomic<int> cancel_work = 0;
std::mutex mutex;
//Thread 1 executes this function
void thread1_func()
{
cancel_work.store(1, <some memory order>);
}
// Thread 2 executes this function
void thread2_func()
{
//No guarantee tmp will be 1, even when thread1_func is executed first
int tmp = cancel_work.load(<some memory order>);
}
However my question is, what happens when using a mutex and lock instead? Do we have any guarantee of the freshness of shared data accessed?
For example, assuming both thread 1 and thread 2 are run concurrently and thread 1 obtains the lock first (executes first). Does it guarantee that thread 2 will see the modified value and not an old value?
Does it matter whether the shared data "cancel_work" is atomic or not in this case?
#include <atomic>
int cancel_work = 0; //any difference if replaced with std::atomic<int> in this case?
std::mutex mutex;
// Thread 1 executes this function
void thread1_func()
{
//Assuming Thread 1 enters lock FIRST
std::lock_guard<std::mutex> lock(mutex);
cancel_work = 1;
}
// Thread 2 executes this function
void thread2_func()
{
std::lock_guard<std::mutex> lock(mutex);
int tmp = cancel_work; //Will tmp be 1 or 0?
}
int main()
{
std::thread t1(thread1_func);
std::thread t2(thread2_func);
t1.join(); t2.join();
return 0;
}

Yes, the using of the mutex/lock guarantees that thread2_func() will obtain a modified value.
However, according to the std::atomic specification:
The synchronization is established only between the threads releasing
and acquiring the same atomic variable. Other threads can see
different order of memory accesses than either or both of the
synchronized threads.
So your code will work correctly using acquire/release logic, too.
#include <atomic>
std::atomic<int> cancel_work = 0;
void thread1_func()
{
cancel_work.store(1, std::memory_order_release);
}
void thread2_func()
{
// tmp will be 1, when thread1_func is executed first
int tmp = cancel_work.load(std::memory_order_acquire);
}

The C++ standard only constrains the observable behavior of the abstract machine in well formed programs without undefined behavior anywhere during the abstract machine's execution.
It provides no guarantees about mapping between the physical hardware actions the program executes and behavior.
In your cases, on the abstract machine there is no ordering between thread1 and thread2's execution. Even if the physical hardware where to schedule and run thread1 before thread2, that places zero constraints (in your simple example) on the output the program generates. The programs' output is only contrained by what legal outputs the abstract machine could produce.
A C++ compiler can legally:
Eliminate your program completely as equivalent to return 0;
Prove that the read of cancel_work in thread2 is unsequenced relative to all modification of cancel_work away from 0, and change it to a constant read of 0.
Actually run thread1 first then run thread2, but prove it can treat the operations in thread2 as-if they occurred before thread1 ran, so don't bother forcing a cache line refresh in thread2 and reading stale data from cancel_work.
What actually happens on the hardware does not impact what the program can legally do. And what the program can legally do is in threading sitations is restricted by observable behavior of the abstract machine, and on the behavior of synchronization primitives and their use in different threads.
For an actual happens before relationship to occur, you need something like:
std::thread(thread1_func).join();
std::thread(thread2_func).join();
and now we do know that everything in thread1_func happens before thread2_func.
We can still rewrite your program as return 0; and similar changes. But we now have a guarantee that thread1_func happens before thread2_func code does.
Note that we can eliminate (1) above via:
std::lock_guard<std::mutex> lock(mutex);
int tmp = cancel_work; //Will tmp be 1 or 0?
std::cout << tmp;
and cause tmp to actually be printed.
The program can then be converted to one that prints 1 or 0 and has no threading at all. It could keep the threading, but change thread2_func to print a constant 0. Etc.
So we rewrite your program to look like this:
std::condition_variable cv;
bool writ = false;
int cancel_work = 0; //any difference if replaced with std::atomic<int> in this case?
std::mutex mutex;
// Thread 1 executes this function
void thread1_func()
{
{
std::lock_guard<std::mutex> lock(mutex);
cancel_work = 1;
}
{
std::lock_guard<std::mutex> lock(mutex);
writ = true;
cv.notify_all();
}
}
// Thread 2 executes this function
void thread2_func()
{
std::unique_lock<std::mutex> lock(mutex);
cv.wait(lock, []{ return writ; } );
int tmp = cancel_work;
std::cout << tmp; // will print 1
}
int main()
{
std::thread t1(thread1_func);
std::thread t2(thread2_func);
t1.join(); t2.join();
return 0;
}
and now thread2_func happens after thread1_func and all is good. The read is guaranteed to be 1.

C++ - Threads without coordinating mechanism like mutex_Lock

I attended one interview two days back. The interviewed guy was good in C++, but not in multithreading. When he asked me to write a code for multithreading of two threads, where one thread prints 1,3,5,.. and the other prints 2,4,6,.. . But, the output should be 1,2,3,4,5,.... So, I gave the below code(sudo code)
mutex_Lock LOCK;
int last=2;
int last_Value = 0;
void function_Thread_1()
{
while(1)
{
mutex_Lock(&LOCK);
if(last == 2)
{
cout << ++last_Value << endl;
last = 1;
}
mutex_Unlock(&LOCK);
}
}
void function_Thread_2()
{
while(1)
{
mutex_Lock(&LOCK);
if(last == 1)
{
cout << ++last_Value << endl;
last = 2;
}
mutex_Unlock(&LOCK);
}
}
After this, he said "these threads will work correctly even without those locks. Those locks will reduce the efficiency". My point was without the lock there will be a situation where one thread will check for(last == 1 or 2) at the same time the other thread will try to change the value to 2 or 1. So, My conclusion is that it will work without that lock, but that is not a correct/standard way. Now, I want to know who is correct and in which basis?

Without the lock, running the two functions concurrently would be undefined behaviour because there's a data race in the access of last and last_Value Moreover (though not causing UB) the printing would be unpredictable.
With the lock, the program becomes essentially single-threaded, and is probably slower than the naive single-threaded code. But that's just in the nature of the problem (i.e. to produce a serialized sequence of events).

I think the interviewer might have thought about using atomic variables.
Each instantiation and full specialization of the std::atomic template defines an atomic type. Objects of atomic types are the only C++ objects that are free from data races; that is, if one thread writes to an atomic object while another thread reads from it, the behavior is well-defined.
In addition, accesses to atomic objects may establish inter-thread synchronization and order non-atomic memory accesses as specified by std::memory_order.
[Source]
By this I mean the only thing you should change is remove the locks and change the lastvariable to std::atomic<int> last = 2; instead of int last = 2;
This should make it safe to access the last variable concurrently.
Out of curiosity I have edited your code a bit, and ran it on my Windows machine:
#include <iostream>
#include <atomic>
#include <thread>
#include <Windows.h>
std::atomic<int> last=2;
std::atomic<int> last_Value = 0;
std::atomic<bool> running = true;
void function_Thread_1()
{
while(running)
{
if(last == 2)
{
last_Value = last_Value + 1;
std::cout << last_Value << std::endl;
last = 1;
}
}
}
void function_Thread_2()
{
while(running)
{
if(last == 1)
{
last_Value = last_Value + 1;
std::cout << last_Value << std::endl;
last = 2;
}
}
}
int main()
{
std::thread a(function_Thread_1);
std::thread b(function_Thread_2);
while(last_Value != 6){}//we want to print 1 to 6
running = false;//inform threads we are about to stop
a.join();
b.join();//join
while(!GetAsyncKeyState('Q')){}//wait for 'Q' press
return 0;
}
and the output is always:
1
2
3
4
5
6
Ideone refuses to run this code (compilation errors)..
Edit: But here is a working linux version :) (thanks to soon)

The interviewer doesn't know what he is talking about. Without the locks you get races on both last and last_value. The compiler could for example reorder the assignment to last before the print and increment of last_value, which could lead to the other thread executing on stale data. Furthermore you could get interleaved output, meaning things like two numbers not being seperated by a linebreak.
Another thing, which could go wrong is that the compiler might decide not to reload last and (less importantly) last_value each iteration, since it can't (safely) change between those iterations anyways (since data races are illegal by the C++11 standard and aren't acknowledged in previous standards). This means that the code suggested by the interviewer actually has a good chance of creating infinite loops of doing absoulutely doing nothing.
While it is possible to make that code correct without mutices, that absolutely needs atomic operations with appropriate ordering constraints (release-semantics on the assignment to last and acquire on the load of last inside the if statement).
Of course your solution does lower efficiency due to effectivly serializing the whole execution. However since the runtime is almost completely spent inside the streamout operation, which is almost certainly internally synchronized by the use of locks, your solution doesn't lower the efficiency anymore then it already is. Waiting on the lock in your code might actually be faster then busy waiting for it, depending on the availible resources (the nonlocking version using atomics would absolutely tank when executed on a single core machine)

Understanding c++11 memory fences

I'm trying to understand memory fences in c++11, I know there are better ways to do this, atomic variables and so on, but wondered if this usage was correct. I realize that this program doesn't do anything useful, I just wanted to make sure that the usage of the fence functions did what I thought they did.
Basically that the release ensures that any changes made in this thread before the fence are visible to other threads after the fence, and that in the second thread that any changes to the variables are visible in the thread immediately after the fence?
Is my understanding correct? Or have I missed the point entirely?
#include <iostream>
#include <atomic>
#include <thread>
int a;
void func1()
{
for(int i = 0; i < 1000000; ++i)
{
a = i;
// Ensure that changes to a to this point are visible to other threads
atomic_thread_fence(std::memory_order_release);
}
}
void func2()
{
for(int i = 0; i < 1000000; ++i)
{
// Ensure that this thread's view of a is up to date
atomic_thread_fence(std::memory_order_acquire);
std::cout << a;
}
}
int main()
{
std::thread t1 (func1);
std::thread t2 (func2);
t1.join(); t2.join();
}

Your usage does not actually ensure the things you mention in your comments. That is, your usage of fences does not ensure that your assignments to a are visible to other threads or that the value you read from a is 'up to date.' This is because, although you seem to have the basic idea of where fences should be used, your code does not actually meet the exact requirements for those fences to "synchronize".
Here's a different example that I think demonstrates correct usage better.
#include <iostream>
#include <atomic>
#include <thread>
std::atomic<bool> flag(false);
int a;
void func1()
{
a = 100;
atomic_thread_fence(std::memory_order_release);
flag.store(true, std::memory_order_relaxed);
}
void func2()
{
while(!flag.load(std::memory_order_relaxed))
;
atomic_thread_fence(std::memory_order_acquire);
std::cout << a << '\n'; // guaranteed to print 100
}
int main()
{
std::thread t1 (func1);
std::thread t2 (func2);
t1.join(); t2.join();
}
The load and store on the atomic flag do not synchronize, because they both use the relaxed memory ordering. Without the fences this code would be a data race, because we're performing conflicting operations a non-atomic object in different threads, and without the fences and the synchronization they provide there would be no happens-before relationship between the conflicting operations on a.
However with the fences we do get synchronization because we've guaranteed that thread 2 will read the flag written by thread 1 (because we loop until we see that value), and since the atomic write happened after the release fence and the atomic read happens-before the acquire fence, the fences synchronize. (see § 29.8/2 for the specific requirements.)
This synchronization means anything that happens-before the release fence happens-before anything that happens-after the acquire fence. Therefore the non-atomic write to a happens-before the non-atomic read of a.
Things get trickier when you're writing a variable in a loop, because you might establish a happens-before relation for some particular iteration, but not other iterations, causing a data race.
std::atomic<int> f(0);
int a;
void func1()
{
for (int i = 0; i<1000000; ++i) {
a = i;
atomic_thread_fence(std::memory_order_release);
f.store(i, std::memory_order_relaxed);
}
}
void func2()
{
int prev_value = 0;
while (prev_value < 1000000) {
while (true) {
int new_val = f.load(std::memory_order_relaxed);
if (prev_val < new_val) {
prev_val = new_val;
break;
}
}
atomic_thread_fence(std::memory_order_acquire);
std::cout << a << '\n';
}
}
This code still causes the fences to synchronize but does not eliminate data races. For example if f.load() happens to return 10 then we know that a=1,a=2, ... a=10 have all happened-before that particular cout<<a, but we don't know that cout<<a happens-before a=11. Those are conflicting operations on different threads with no happens-before relation; a data race.

Your usage is correct, but insufficient to guarantee anything useful.
For example, the compiler is free to internally implement a = i; like this if it wants to:
while(a != i)
{
++a;
atomic_thread_fence(std::memory_order_release);
}
So the other thread may see any values at all.
Of course, the compiler would never implement a simple assignment like that. However, there are cases where similarly perplexing behavior is actually an optimization, so it's a very bad idea to rely on ordinary code being implemented internally in any particular way. This is why we have things like atomic operations and fences only produce guaranteed results when used with such operations.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js