Understanding c++11 memory fences - c++

I'm trying to understand memory fences in c++11, I know there are better ways to do this, atomic variables and so on, but wondered if this usage was correct. I realize that this program doesn't do anything useful, I just wanted to make sure that the usage of the fence functions did what I thought they did.
Basically that the release ensures that any changes made in this thread before the fence are visible to other threads after the fence, and that in the second thread that any changes to the variables are visible in the thread immediately after the fence?
Is my understanding correct? Or have I missed the point entirely?
#include <iostream>
#include <atomic>
#include <thread>
int a;
void func1()
{
for(int i = 0; i < 1000000; ++i)
{
a = i;
// Ensure that changes to a to this point are visible to other threads
atomic_thread_fence(std::memory_order_release);
}
}
void func2()
{
for(int i = 0; i < 1000000; ++i)
{
// Ensure that this thread's view of a is up to date
atomic_thread_fence(std::memory_order_acquire);
std::cout << a;
}
}
int main()
{
std::thread t1 (func1);
std::thread t2 (func2);
t1.join(); t2.join();
}

Your usage does not actually ensure the things you mention in your comments. That is, your usage of fences does not ensure that your assignments to a are visible to other threads or that the value you read from a is 'up to date.' This is because, although you seem to have the basic idea of where fences should be used, your code does not actually meet the exact requirements for those fences to "synchronize".
Here's a different example that I think demonstrates correct usage better.
#include <iostream>
#include <atomic>
#include <thread>
std::atomic<bool> flag(false);
int a;
void func1()
{
a = 100;
atomic_thread_fence(std::memory_order_release);
flag.store(true, std::memory_order_relaxed);
}
void func2()
{
while(!flag.load(std::memory_order_relaxed))
;
atomic_thread_fence(std::memory_order_acquire);
std::cout << a << '\n'; // guaranteed to print 100
}
int main()
{
std::thread t1 (func1);
std::thread t2 (func2);
t1.join(); t2.join();
}
The load and store on the atomic flag do not synchronize, because they both use the relaxed memory ordering. Without the fences this code would be a data race, because we're performing conflicting operations a non-atomic object in different threads, and without the fences and the synchronization they provide there would be no happens-before relationship between the conflicting operations on a.
However with the fences we do get synchronization because we've guaranteed that thread 2 will read the flag written by thread 1 (because we loop until we see that value), and since the atomic write happened after the release fence and the atomic read happens-before the acquire fence, the fences synchronize. (see § 29.8/2 for the specific requirements.)
This synchronization means anything that happens-before the release fence happens-before anything that happens-after the acquire fence. Therefore the non-atomic write to a happens-before the non-atomic read of a.
Things get trickier when you're writing a variable in a loop, because you might establish a happens-before relation for some particular iteration, but not other iterations, causing a data race.
std::atomic<int> f(0);
int a;
void func1()
{
for (int i = 0; i<1000000; ++i) {
a = i;
atomic_thread_fence(std::memory_order_release);
f.store(i, std::memory_order_relaxed);
}
}
void func2()
{
int prev_value = 0;
while (prev_value < 1000000) {
while (true) {
int new_val = f.load(std::memory_order_relaxed);
if (prev_val < new_val) {
prev_val = new_val;
break;
}
}
atomic_thread_fence(std::memory_order_acquire);
std::cout << a << '\n';
}
}
This code still causes the fences to synchronize but does not eliminate data races. For example if f.load() happens to return 10 then we know that a=1,a=2, ... a=10 have all happened-before that particular cout<<a, but we don't know that cout<<a happens-before a=11. Those are conflicting operations on different threads with no happens-before relation; a data race.

Your usage is correct, but insufficient to guarantee anything useful.
For example, the compiler is free to internally implement a = i; like this if it wants to:
while(a != i)
{
++a;
atomic_thread_fence(std::memory_order_release);
}
So the other thread may see any values at all.
Of course, the compiler would never implement a simple assignment like that. However, there are cases where similarly perplexing behavior is actually an optimization, so it's a very bad idea to rely on ordinary code being implemented internally in any particular way. This is why we have things like atomic operations and fences only produce guaranteed results when used with such operations.

Related

How to implement simple custom lock using acquire-release memory order model?

I'm trying to understand acquire-release memory order through implementing my custom lock.
#include <atomic>
#include <vector>
#include <thread>
#include <iostream>
class my_lock {
static std::atomic<bool> flag;
public:
void lock() {
bool expected = false;
while (!flag.compare_exchange_strong(expected, true, std::memory_order_acq_rel))
expected = false;
}
void unlock() {
flag.store(false, std::memory_order_release);
}
};
std::atomic<bool> my_lock::flag(false);
static int num0 = 0;
static int num1 = 0;
my_lock lk;
void increase() {
for(int i = 0; i < 100000; ++i) {
lk.lock();
num0++;
num1++;
lk.unlock();
}
}
void read() {
for(int i = 0; i < 100000; ++i) {
lk.lock();
if(num0 > num1) {
std::cout << "num0:" << num0 << " > " << "num1:" << num1 << std::endl;
}
lk.unlock();
}
}
int main() {
std::thread t1(increase);
std::thread t2(read);
t1.join();
t2.join();
std::cout << "finished! num0:" << num0 << ", num1:"<< num1 << std::endl;
}
Question 1: Am I correct to use acquire-release memory order ?
Below are paragraph from C++ Concurrency In Action
Despite the potentially non-intuitive outcomes, anyone who’s used locks has had to deal with the same ordering issues: locking a mutex is an acquire operation, and unlocking the mutex is a release operation. With mutexes, you learn that you must ensure that the same mutex is locked when you read a value as was locked when you wrote it, and the same applies here; your acquire and release operations have to be on the same variable to ensure an ordering. If data is protected with a mutex, the exclusive nature of the lock means that the result is indistinguishable from what it would have been had the lock and unlock been sequentially consistent operations. Similarly, if you use acquire and release orderings on atomic variables to build a simple lock, then from the point of view of code that uses the lock, the behavior will appear sequentially consistent, even though the internal operations are not.
This paragraph says that "the result is indistinguishable from ..... sequentially consistent operation".
Question 2: Why the result is indistinguishable? If we use lock, the result should be distinguishable from my understanding.
Edit :
I add one more question.
Below is std::atomic_flag example in C++ concurrency in action.
class spinlock_mutex
{
std::atomic_flag flag;
public:
spinlock_mutex():
flag(ATOMIC_FLAG_INIT)
{}
void lock()
{
while(flag.test_and_set(std::memory_order_acquire));
}
void unlock()
{
flag.clear(std::memory_order_release);
}
}
Question 3: Why does this code doesn't use std::memory_order_acq_rel? flag.test_and_set is RWM operation, so I think it should be used with std::memory_order_acq_rel like my first example.
The acquire/release is strong enough to confine the instructions within the critical sections. And it provides sufficient synchronization such that a happens before relation is established between a release of a lock and subsequent acquire of the same lock.
The lock will give you some sequential order on the lock acquire/release operations; just as with sequential consistency.
Why are you using compare_exchange_strong? You are already in a loop. You can use compare_exchange_weak.

C++ memory_order_acquire/release questions

I recently learn about c++ six memory orders, I felt very confusing about memory_order_acquire and memory_order_release, here is an example from cpp:
#include <thread>
#include <atomic>
#include <cassert>
std::atomic<bool> x = {false};
std::atomic<bool> y = {false};
std::atomic<int> z = {0};
void write_x() { x.store(true, std::memory_order_seq_cst); }
void write_y() { y.store(true, std::memory_order_seq_cst); }
void read_x_then_y() {
while (!x.load(std::memory_order_seq_cst));
if (y.load(std::memory_order_seq_cst))
++z;
}
void read_y_then_x() {
while (!y.load(std::memory_order_seq_cst));
if (x.load(std::memory_order_seq_cst))
++z;
}
int main() {
std::thread a(write_x);
std::thread b(write_y);
std::thread c(read_x_then_y);
std::thread d(read_y_then_x);
a.join(); b.join(); c.join(); d.join();
assert(z.load() != 0); // will never happen
}
In the cpp reference page, it says:
This example demonstrates a situation where sequential ordering is necessary.
Any other ordering may trigger the assert because it would be possible
for the threads c and d to observe changes to the atomics x and y in
opposite order.
So my question is why memory_order_acquire and memory_order_release can not be used here? And what semantics does memory_order_acquire and memory_order_release provide?
some references:
https://en.cppreference.com/w/cpp/atomic/memory_order
https://gcc.gnu.org/wiki/Atomic/GCCMM/AtomicSync
Sequential consistency provides a single total order of all sequentially consistent operations. So if you have a sequentially consistent store in thread A, and a sequentially consistent load in thread B, and the store is ordered before the load (in said single total order), then B observes the value stored by A. So basically sequential consistency guarantees that the store is "immediately visible" to other threads. A release store does not provide this guarantee.
As Peter Cordes pointed out correctly, the term "immediately visible" is rather imprecise. The "visibility" stems from the fact that all seq-cst operations are totally ordered, and all threads observe that order. Since the store and the load are totally ordered, the value of a store becomes visible before a subsequent load (in the single total order) is executed.
There exists no such total order between acquire/release operations in different threads, so there is not visibility guarantee. The operations are only ordered once an acquire-operations observes the value from a release-operation, but there is no guarantee when the value of the release-operation becomes visible to the thread performing the acquire-operation.
Let's consider what would happen if we were to use acquire/release in this example:
void write_x() { x.store(true, std::memory_order_release); }
void write_y() { y.store(true, std::memory_order_release); }
void read_x_then_y() {
while (!x.load(std::memory_order_acquire));
if (y.load(std::memory_order_acquire))
++z;
}
void read_y_then_x() {
while (!y.load(std::memory_order_acquire));
if (x.load(std::memory_order_acquire))
++z;
}
int main() {
std::thread a(write_x);
std::thread b(write_y);
std::thread c(read_x_then_y);
std::thread d(read_y_then_x);
a.join(); b.join(); c.join(); d.join();
assert(z.load() != 0); // can actually happen!!
}
Since we have no guarantee about visibility, it could happen that thread c observes x == true and y == false, while at the same time thread d could observe y == true and x == false. So neither thread would increment z and the assertion would fire.
For more details about the C++ memory model I can recommend this paper which I have co-authored: Memory Models for C/C++ Programmers
You can use aquire/release when passing information from one thread to another - this is the most common situation. No need for sequential requirements on this one.
In this example there are a bunch of threads. Two threads make write operation while third roughly tests whether x was ready before y and fourth tests whether y was ready before x. Theoretically one thread may observe that x was modified before y while another sees that y was modified before x. Not entirely sure how likely it is. This is an uncommon usecase.
Edit: you can visualize the example: assume that each threads is run on a different PC and they communicate via a network. Each pair of PCs has a different ping to each other. Here it is easy to make an example where it is unclear which event occurred first x or y as each PC will see the events occur in different order.
I am not sure on sure on which architectures this effect may occur but there are complex ones where two different processors are conjoined. Surely communication between the processors is slower than between cores of each processor.

Do locks in C++ 11 guarantee freshness of accessed data?

Usually, when using std::atomic types accessed concurrently by multiple threads, there's no guarantee a thread will read the "up to date" value when accessing them, and a thread may get a stale value from cache or any older value. The only way to get the up to date value are functions such as compare_exchange_XXX. (See questions here and here)
#include <atomic>
std::atomic<int> cancel_work = 0;
std::mutex mutex;
//Thread 1 executes this function
void thread1_func()
{
cancel_work.store(1, <some memory order>);
}
// Thread 2 executes this function
void thread2_func()
{
//No guarantee tmp will be 1, even when thread1_func is executed first
int tmp = cancel_work.load(<some memory order>);
}
However my question is, what happens when using a mutex and lock instead? Do we have any guarantee of the freshness of shared data accessed?
For example, assuming both thread 1 and thread 2 are run concurrently and thread 1 obtains the lock first (executes first). Does it guarantee that thread 2 will see the modified value and not an old value?
Does it matter whether the shared data "cancel_work" is atomic or not in this case?
#include <atomic>
int cancel_work = 0; //any difference if replaced with std::atomic<int> in this case?
std::mutex mutex;
// Thread 1 executes this function
void thread1_func()
{
//Assuming Thread 1 enters lock FIRST
std::lock_guard<std::mutex> lock(mutex);
cancel_work = 1;
}
// Thread 2 executes this function
void thread2_func()
{
std::lock_guard<std::mutex> lock(mutex);
int tmp = cancel_work; //Will tmp be 1 or 0?
}
int main()
{
std::thread t1(thread1_func);
std::thread t2(thread2_func);
t1.join(); t2.join();
return 0;
}
Yes, the using of the mutex/lock guarantees that thread2_func() will obtain a modified value.
However, according to the std::atomic specification:
The synchronization is established only between the threads releasing
and acquiring the same atomic variable. Other threads can see
different order of memory accesses than either or both of the
synchronized threads.
So your code will work correctly using acquire/release logic, too.
#include <atomic>
std::atomic<int> cancel_work = 0;
void thread1_func()
{
cancel_work.store(1, std::memory_order_release);
}
void thread2_func()
{
// tmp will be 1, when thread1_func is executed first
int tmp = cancel_work.load(std::memory_order_acquire);
}
The C++ standard only constrains the observable behavior of the abstract machine in well formed programs without undefined behavior anywhere during the abstract machine's execution.
It provides no guarantees about mapping between the physical hardware actions the program executes and behavior.
In your cases, on the abstract machine there is no ordering between thread1 and thread2's execution. Even if the physical hardware where to schedule and run thread1 before thread2, that places zero constraints (in your simple example) on the output the program generates. The programs' output is only contrained by what legal outputs the abstract machine could produce.
A C++ compiler can legally:
Eliminate your program completely as equivalent to return 0;
Prove that the read of cancel_work in thread2 is unsequenced relative to all modification of cancel_work away from 0, and change it to a constant read of 0.
Actually run thread1 first then run thread2, but prove it can treat the operations in thread2 as-if they occurred before thread1 ran, so don't bother forcing a cache line refresh in thread2 and reading stale data from cancel_work.
What actually happens on the hardware does not impact what the program can legally do. And what the program can legally do is in threading sitations is restricted by observable behavior of the abstract machine, and on the behavior of synchronization primitives and their use in different threads.
For an actual happens before relationship to occur, you need something like:
std::thread(thread1_func).join();
std::thread(thread2_func).join();
and now we do know that everything in thread1_func happens before thread2_func.
We can still rewrite your program as return 0; and similar changes. But we now have a guarantee that thread1_func happens before thread2_func code does.
Note that we can eliminate (1) above via:
std::lock_guard<std::mutex> lock(mutex);
int tmp = cancel_work; //Will tmp be 1 or 0?
std::cout << tmp;
and cause tmp to actually be printed.
The program can then be converted to one that prints 1 or 0 and has no threading at all. It could keep the threading, but change thread2_func to print a constant 0. Etc.
So we rewrite your program to look like this:
std::condition_variable cv;
bool writ = false;
int cancel_work = 0; //any difference if replaced with std::atomic<int> in this case?
std::mutex mutex;
// Thread 1 executes this function
void thread1_func()
{
{
std::lock_guard<std::mutex> lock(mutex);
cancel_work = 1;
}
{
std::lock_guard<std::mutex> lock(mutex);
writ = true;
cv.notify_all();
}
}
// Thread 2 executes this function
void thread2_func()
{
std::unique_lock<std::mutex> lock(mutex);
cv.wait(lock, []{ return writ; } );
int tmp = cancel_work;
std::cout << tmp; // will print 1
}
int main()
{
std::thread t1(thread1_func);
std::thread t2(thread2_func);
t1.join(); t2.join();
return 0;
}
and now thread2_func happens after thread1_func and all is good. The read is guaranteed to be 1.

Can I use no thread synchronization here?

I am researching mutexes.
I come up with this example that seems to work without any synchronization.
#include <cstdint>
#include <thread>
#include <iostream>
constexpr size_t COUNT = 10000000;
int g_x = 0;
void p1(){
for(size_t i = 0; i < COUNT; ++i){
++g_x;
}
}
void p2(){
int a = 0;
for(size_t i = 0; i < COUNT; ++i){
if (a > g_x){
std::cout << "Problem detected" << '\n';
}
a = g_x;
}
}
int main(){
std::thread t1{ p1 };
std::thread t2{ p2 };
t1.join();
t2.join();
std::cout << g_x << '\n';
}
My assumptions are following:
Thread 1 change the value of g_x, but it is the only thread that change the value, so theoretically this suppose to be OK.
Thread 2 reads the value of g_x. Reads suppose to be atomic on x86 and ARM. So there must be no problem there too. I have example with several read threads and it works OK too.
With other words, write is not shared and reads are atomic.
Are the assumptions correct?
There's certainly a data race here: g_x is not an std::atomic; it is written to by one thread, and read from by another. So the results are undefined.
Note that the CPU memory model is only part of the deal. The compiler might do all sorts of optimizations (using registers, reordering etc.) if you don't declare your shared variables properly.
As for mutexes, you do not need one here. Declaring g_x as atomic should remove the UB and guarantee proper communication between the threads. Btw, the for in p2 can probably be optimized out even if you're using atomics, but I assume this is just a reduced code and not the real thing.

Many reads/one write in atomic variable

Could you help me please.
Suppose I have p - 1 read threads and one write thread. They all read and write in one atomic int variable. Could it be that if all reads and write occur simultaneously the write operation will wait p - 1 time? I have doubts because when atomic operation happens there is some strange lock(in assembler) and I afraid that it locks memory(where variable is). So it could happen that write operation will wait for p-1 reads. Could it happen?
Here is some simple code:
#include <atomic>
#include <iostream>
#include <thread>
#include <vector>
std::atomic<int> val;
void writer()
{
val.store(7);
}
void read()
{
int tmp = val.load();
while (tmp == 0)
{
std::cout << std::this_thread::get_id() << ": wait" << std::endl;
tmp = val.load();
}
std::cout << std::this_thread::get_id() << " Operation: " << tmp * tmp << std::endl;
}
int main()
{
val.store(0);
std::vector<std::thread> v;
for (int i = 0; i < 1; ++i)
v.push_back(std::thread(read));
std::this_thread::sleep_for(std::chrono::milliseconds(77));
writer();
std::for_each(v.begin(), v.end(), std::mem_fn(&std::thread::join));
return 0;
}
Thank you!
Processor's instruction, which locks memory bus(has LOCK prefix), does not use locking in usual, high-level sence. It makes threads(caller one and, probably, some concurrent threads which access same or near memory blocks) a bit slower.
The upper limit of this bit is only depends from machine and its architecture.
Normal locks also make threads slower, but amount of this slowerness highly depends from lock contention, locking implementation properties(e.g., fairness), and code under lock protection. You shouldn't bother about locked memory access except because of perfomance reason.
Actually, LOCK prefix doesn't need for atomic loads/stores. I guess, it is a compiler optimization, which provides sequential consistent memory order. This order is enforced by .store() and .load() atomic's methods by default, but it is unnecessary in your example. The mostly used pattern is:
use relaxed memory order for initialization:
val.store(0, std::memory_order_relaxed);
use acquire memory order for read value:
tmp = val.load(std::memory_order_acquire);
use release memory order for write(change) value:
val.store(7, std::memory_order_release);
This will prevent compiler from using instructions with LOCK prefix.