memory model, how load acquire semantic actually works? - c++

From very nice Paper and article about memory reordering.
Q1: I understand that cache-coherence, store buffer and invalidation queue is root cause of memory reordering ?
Store release is quite understandable, have to wait for all load and store are completed before set flag to true.
About load acquire, typical use of atomic load is waiting for a flag. Suppose we have 2 threads:
int x = 0;
std::atomic<bool> ready_flag = false;
// thread-1
if(ready_flag.load(std::memory_order_relaxed))
{
// (1)
// load x here
}
// (2)
// load x here
// thread-2
x = 100;
ready_flag.store(true, std::memory_order_release);
EDIT: in thread-1, it should be a while loop, but I copied the logic from article above. So, assume memory-reorder is occurred just in time.
Q2: Because (1) and (2) depends on if condition, CPU have to wait for ready_flag, does it mean write-release is enough ? How memory-reordering can happens with this context ?
Q3: Obviously we have load-acquire, so I guess mem-reorder is possible, then where should we place the fence, (1) or (2) ?

Accessing an atomic variable is not a mutex operation; it merely accesses the stored value atomically, with no chance for any CPU operation to interrupt the access such that no data races can occur with regard to accessing that value (it can also issue barriers with regard to other accesses, which is what the memory orders provide). But that's it; it doesn't wait for any particular value to appear in the atomic variable.
As such, your if statement will read whatever value happens to be there at the time. If you want to guard access to x until the other statement has written to it and signaled the atomic, you must:
Not allow any code to read from x until the atomic flag has returned the value true. Simply testing the value once won't do that; you must loop over repeated accesses until it is true. Any other attempt to read from x results in a data race and is therefore undefined behavior.
Whenever you access the flag, you must do so in a way that tells the system that values written by the thread setting that flag should be visible to subsequent operations that see the set value. That requires a proper memory order, one which must be at least memory_order_acquire.
To be technical, the read from the flag itself doesn't have to do the acquire. You could perform an acquire operation after having read the proper value from the flag. But you need to have an acquire-equivalent operation happen before reading x.
The writing statement must set the flag using a releasing memory order that must be at least as powerful as memory_order_release.

Because (1) and (2) depends on if condition, CPU have to wait for ready_flag
There are 2 showstopper flaws in that reasoning:
Branch prediction + speculative execution is a real thing in real CPUs. Control dependencies behave differently from data dependencies. Speculative execution breaks control dependencies.
In most (but not all) real CPUs, data dependencies do work like C++ memory_order_consume. A typical use-case is loading a pointer and then dereferencing it. That's still not safe in C++'s very weak memory model, but will happen to compile to asm that works for most ISAs other than DEC Alpha. Alpha can (in practice on some hardware) even manage to violate causality and load a stale value when dereferencing a just-loaded pointer, even if the stores were correctly ordered.
Compilers can break control and even data dependencies. C++ source logic doesn't always translate directly to asm. In this case a compiler could emit asm that works like this:
tmp = load(x); // compile time reordering before the relaxed load
if (load(ready_flag)
actually use tmp;
It's data-race UB in C++ to read x while it might still be being written, but for most specific ISAs there's no problem with that. You just have to avoid actually using any load results that might be bogus.
This might not be a useful optimization for most ISAs but nothing rules it out. Hiding load latency on in-order pipelines by doing the load earlier might actually be useful sometimes, (if it wasn't being written by another thread, and the compiler might guess that wasn't happening because there's no acquire load).
By far your best bet is to use ready_flag.load(mo_acquire).
A separate problem is that you have commented out code that reads x after the if(), which will run even if the load didn't see the data ready. As #Nicol explained in an answer, this means data-race UB is possible because you might be reading x while the producer is writing it.
Perhaps you wanted to write a spin-wait loop like while(!ready_flag){ _mm_pause(); }? Generally be careful of wasting huge amounts of CPU time spinning; if it might be a long time, use a library-supported thing like maybe a condition variable that gives you efficient fallback to OS-supported sleep/wakeup (e.g. Linux futex) after spinning for a short time.
If you did want a manual barrier separate from the load, it would be
if (ready_flag.load(mo_relaxed))
atomic_thread_fence(mo_acquire);
int tmp = x; // now this is safe
}
// atomic_thread_fence(mo_acquire); // still wouldn't make it safe to read x
// because this code runs even after ready_flag == false
Using if(ready_flag.load(mo_acquire)) would lead to an unconditional fence before branching on the ready_flag load, when compiling for any ISA where acquire-load wasn't available with a single instruction. (On x86 all loads are acquire, on AArch64 ldar does an acquire load. ARM needs load + dsb ish)

The C++ standard doesn't specify the code generated by any particular construct; only correct combinations of thread communication tools product a guaranteed result.
You don't get guarantees from the CPU in C++ because C++ is not a kind of (macro) assembly, not even a "high level assembly", at least not when not all objects have a volatile type.
Atomic objects are communication tools to exchange data between threads. The correct use, for correct visibility of memory operations, is either a store operation with (at least) release followed by a load with acquire, the same with RMW in between, either the store (resp. the load) replaced by RMW with (at least) a release (resp. acquire), on any variant with a relaxed operation and a separate fence.
In all cases:
the thread "publishing" the "done" flag must use a memory ordering at least release (that is: release, release+acquire or sequential consistency),
and the "subscribing" thread, the one acting on the flag must use at least acquire (that is: acquire, release+acquire or sequential consistency).
In practice with separately compiled code other modes might work, depending on the CPU.

Related

ARM64 64 bit load/store data race

According to this, a 64 bit load/store is considered to be an atomic access on arm64. Given this, is the following program still considered to have a data race (and thus can exhibit UB) when compiled for arm64 (ignore ordering with respect to other memory accesses)
uint64_t x;
// Thread 1
void f()
{
uint64_t a = x;
}
// Thread 2
void g()
{
x = 1;
}
If instead I switch this to using
std::atomic<uint64_t> x{};
// Thread 1
void f()
{
uint64_t a = x.load(std::memory_order_relaxed);
}
// Thread 2
void g()
{
x.store(1, std::memory_order_relaxed);
}
Is the second program considered data race free?
On arm64, it looks like the compiler ends up generating the same instruction for a normal 64 bit load/store and a load/store of an atomic with memory_order_relaxed, so what's the difference?
std::atomic solves 4 problems.
One is that load/store is atomic, meaning you don't get loads and stores intermixed so that for example you load 32bit from before a store and the other 32bit from after a store. Normally everything up to register size is naturally atomic in that sense on the CPU itself. Things might break with unaligned access, potentially only when the access crosses a cacheline. In std::atmoic<T> implementations you will see the use of locks when the size of T exceeds the size the CPU reads/writes atomically on it's own.
The other thing std::atomic does is synchronize access between threads. Just because one thread writes data to a variable doesn't mean another thread sees that data appear instantly. The writing cpu puts the data into it's store buffer hoping it just gets overwritten again or adjacent memory gets written and the 2 writes can be combined. After a while the data goes to L1 cache where it can stay even longer, then L2 and L3. Depending on the architecture cache may or may not be shared between CPU cores. They also might not synchronize automatically. So when you want to access the same memory address from multiple cores you have to tell the CPU to synchronize the access with other cores.
The third thing has to with modern CPUs doing out-of-order execution and speculative execution. That means even if the code checks a variable and then reads a second variable the CPU might read the second variable first. If the first variable acts as a semaphore signaling the second variable is ready to be read then this can fail because the read happens before the data is ready. The std::atomic adds barriers preventing the CPU to do these reorderings so reads and writes happen in a specific order in the hardware.
The fourth thing is much the same but for the compiler. std::atomic prevents the compiler from reordering instructions across it. Or from optimizing multiple reads or writes into just one.
All of this std::atomic does automatiocaly for you if you just use it without specifying any memory order. The default memory order is the strongest order.
But when you use
uint64_t a = x.load(std::memory_order_relaxed);
you tell the compiler to ignore most of the things:
Relaxed operation: there are no synchronization or ordering constraints imposed on other reads or writes, only this operation's atomicity is guaranteed
So you instructed the compiler not to care about synchronizing with other threads or caches or to preserve the order the instructions are written. All you care about is that reads or writes are not broken up into 2 or more parts where you could get mixed data. The load will get either the whole data from before the store or the whole data from after the store in the other thread. But it's completely undefined which of the two values you get. Which is what you get for all 64bit load/store for free so the code is identical.
Note: if you have multiple atomics then accessing one with a stronger memory order will synchronize both of them. So you can see code that will do one load with a strong order together with others with weak order. Same for groups of writes. This can speed up access. But it's hard to get right.
Whether or not an access is a data race in the sense of the C++ language standard is independent of the underlying hardware. The language has its own memory model and even if a straight-forward compilation to the target architecture would be free of problems, the compiler may still optimize based on the assumption that the program is free of data races in the sense of the C++ memory model.
Accessing a non-atomic in two threads without synchronization with one of them being a write is always a data race in the C++ model. So yes, the first program has a data race and therefore undefined behavior.
In the second program the object is an atomic, so there cannot be a data race.

Do I need writer/reader synchronization if I don't care if reader will read stale data?

Let I have a reader thread. Reader has a vector of bools. Size of the vector isn't changed and always known. Reader reads some data from another source, calculates an index from the data and checks if vector[index] == true. If true, Reader sends data further. If not, drops data.
Let I have a writer thread. Writer makes vector[index] true or false.
Do I really need a mutex for vector if I don't bother that some extra data chunks will be sent or some chunks will be lost? Is it absolutely safe to use a vector in this way?
Reading and writing the same value, however small, from multiple threads without synchronization, is a data race, a form of undefined behavior.
Even if the hardware guarantees cache coherency (as in x86), the C++ memory model is defined such that in the absence of synchronization each thread is assumed to be executing in isolation. Then according to the as-if rule the compiler is allowed to optimize away and reorder memory accesses any way it sees fit, so the behavior of a program with a data race becomes unpredictable. The reader thread may never "see" any updated value, for example. Or the writer may not write anything to memory until the thread is finished, or write in a different order. The behavior may change between compiler versions, optimization levels, etc.
Note that synchronization doesn't mean a mutex, an atomic will do too (a vector of atomics is somewhat complicated, but is possible too, though my feeling is that a userspace mutex would be more efficient).
Bonus note: don't forget about false sharing when accessing the same vector from multiple threads.
As rustyx already indicated, atomics could do the trick.
If you just care about reading the value at some point in the future and not suffer from a data race (so the lack of a happens before relation between the write and the read), then it would be sufficient to set the flags using a memory_order_release and get the flags using a memory_order_acquire.
On the X86; which uses the TSO memory model, all regular stores are release stores and all regular loads are acquire loads. So on a hardware level there is no price to pay. Only the compiler will be prevented from doing certain reorderings.
The expensive write on an X86 is the memory_order_sec_cst. In that case, the store is put on the store buffer and the CPU stops executing any loads till the store buffer has been drained. With a memory_order_sec_cst, the store is placed on the store buffer and the CPU can continue with the next instruction (even loads); so the CPU is not stalled.

With memory_order_relaxed how is total order of modification of an atomic variable assured on typical architectures?

As I understand memory_order_relaxed is to avoid costly memory fences that may be needed with more constrained ordering on a particular architecture.
In that case how is total modification order for an atomic variable achieved on popular processors?
EDIT:
atomic<int> a;
void thread_proc()
{
int b = a.load(memory_order_relaxed);
int c = a.load(memory_order_relaxed);
printf(“first value %d, second value %d\n, b, c);
}
int main()
{
thread t1(thread_proc);
thread t2(thread_proc);
a.store(1, memory_order_relaxed);
a.store(2, memory_order_relaxed);
t1.join();
t2.join();
}
What will guarantee that the output won’t be:
first value 1, second value 2
first value 2, second value 1
?
Multi-processors often use the MESI protocol to ensure total store order on a location. Information is transferred at cache-line granularity. The protocol ensures that before a processor modifies the contents of a cache line, all other processors relinquish their copy of the line, and must reload a copy of the modified line. Hence in the example where a processor writes x and then y to the same location, if any processor sees the write of x, it must have reloaded from the modified line, and must relinquish the line again before the writer writes y.
There is usually a specific set of assembly instructions that corresponds to operations on std::atomics, for example an atomic addition on x86 is lock xadd.
By specifying memory order relaxed you can conceptually think of it as telling the compiler "you must use this technique to increment the value, but I impose no other restrictions outside of the standard as-if optimisations rules on top of that". So literally just replacing an add with an lock xadd is likely sufficient under a relaxed ordering constraint.
Also keep in mind 'memory_order_relaxed' specifies a minimum standard that the compiler has to respect. Some intrinsics on some platforms will have implicit hardware barriers, which doesn't violate the constraint for being too constrained.
All atomic operations act in accord with [intro.races]/14:
If an operation A that modifies an atomic object M happens before an operation B that modifies M, then A shall be earlier than B in the modification order of M.
The two stores from the main thread are required to happen in that order, since the two operations are ordered within the same thread. Therefore, they cannot happen outside of that order. If someone sees the value 2 in the atomic, then the first thread must have executed past the point where the value was set to 1, per [intro.races]/4:
All modifications to a particular atomic object M occur in some particular total order, called the modification order of M.
This of course only applies to atomic operations on a specific atomic object; ordering with respect to other things doesn't exist when using relaxed ordering (which is the point).
How does this get achieved on real machines? In whatever way the compiler sees fit to do so. The compiler could decide that, since you're overwriting the value of the variable you just set, then it can remove the first store per the as-if rule. Nobody ever seeing the value 1 is a perfectly legitimate implementation according to the C++ memory model.
But otherwise, the compiler is required to emit whatever is needed to make it work. Note that out-of-order processors aren't typically allowed to complete dependent operations out of order, so that's typically not a problem.
There are two parts in an inter thread communication:
a core that can do loads and stores
the memory system which consists of coherent caches
The issue is the speculative execution in the CPU core.
A processor load and store unit always need to compare addresses in order to avoid reordering two writes to the same location (if it reorders writes at all) or to pre-fetch a stale value that has just been written to (when reads are done early, before previous writes).
Without that feature, any sequence of executable code would be at risk of having its memory accesses completely randomized, seeing values written by a following instruction, etc. All memory locations would be "renamed" in crazy ways with no way for a program to refer to the same (originally named) location twice in a row.
All programs would break.
On the other hand, memory locations in potentially running code can have two "names":
the location that can hold a modifiable value, in L1d
the location that can be decoded as executable code, in L1i
And these are not connected in any way until a special reload code instruction is performed, not only the L1i but also the instruction decoder can have in cache locations that are otherwise modifiable.
[Another complication is when two virtual addresses (used by speculative loads or stores) refer to the same physical addresses (aliasing): that's another conflict that needs to be dealt with.]
Summary: In most cases, a CPU will naturally to provide an order for accesses on each data memory location.
EDIT:
While a core needs to keep track of operations that invalidate speculative execution, mainly a write to a location later read by a speculative instruction. Reads don't conflict with each others and a CPU core might want to keep track of modification of cached memory after a speculative read (making reads happen visibly in advance) and if reads can be executed out of order it's conceivable that a later read might be complete before an earlier read; on why the system would begin a later read first, a possible cause would be if the address computation is easier and complete first.
So a system that can begin reads out of order and that would consider them completed as soon as a value is made available by the cache, and valid as long as no write by the same core ends up conflicting with either read, and does not monitor L1i cache invalidations caused by another CPU wanting to modify a close memory location (possible that one location), such sequence is possible:
decompose the soon to be executed instructions into sequence A which is long a list of sequenced operations ending with a result in r1 and B a shorter sequence ending with a result in r2
run both in parallel, with B producing a result earlier
speculatively try load (r2), noting that a write that address may invalidate the speculation (suppose the location is available in L1i)
then another CPU annoys us stealing the cache line holding location of (r2)
A completes making r1 value available and we can speculatively do load (r1) (which happens to be the same address as (r2)); which stalls until our cache gets back its cache line
the value of the last done load can be different from the first
Neither speculations of A nor B invalided any memory location, as the system doesn't consider either the loss of cache line or the return of a different value by the last load to be an invalidation of a speculation (which would be easy to implement as we have all the information locally).
Here the system sees any read as non conflicting with any local operation that isn't a local write and the loads are done in an order depending on the complexity of A and B and not whichever comes first in program order (the description above doesn't even say that the program order was changed, just that it was ignored by speculation: I have never described which of the loads was first in the program).
So for a relaxed atomic load, a special instruction would be needed on such system.
The cache system
Of course the cache system doesn't change orders of requests, as it works like a global random access system with temporary ownership by cores.

relaxed ordering and inter thread visibility

I learnt from relaxed ordering as a signal that a store on an atomic variable should be visible to other thread in a "within a reasonnable amount of time".
That say, I am pretty sure it should happen in a very short time (some nano second ?).
However, I don't want to rely on "within a reasonnable amount of time".
So, here is some code :
std::atomic_bool canBegin{false};
void functionThatWillBeLaunchedInThreadA() {
if(canBegin.load(std::memory_order_relaxed))
produceData();
}
void functionThatWillBeLaunchedInThreadB() {
canBegin.store(true, std::memory_order_relaxed);
}
Thread A and B are within a kind of ThreadPool, so there is no creation of thread or whatsoever in this problem.
I don't need to protect any data, so acquire / consume / release ordering on atomic store/load are not needed here (I think?).
We know for sure that the functionThatWillBeLaunchedInThreadAfunction will be launched AFTER the end of the functionThatWillBeLaunchedInThreadB.
However, in such a code, we don't have any guarantee that the store will be visible in the thread A, so the thread A can read a stale value (false).
Here are some solution I think about.
Solution 1 : Use volatility
Just declare volatile std::atomic_bool canBegin{false}; Here the volatileness guarantee us that we will not see stale value.
Solution 2 : Use mutex or spinlock
Here the idea is to protect the canBegin access via a mutex / spinlock that guarantee via a release/acquire ordering that we will not see a stale value.
I don't need canGo to be an atomic either.
Solution 3 : not sure at all, but memory fence?
Maybe this code will not work, so, tell me :).
bool canGo{false}; // not an atomic value now
// in thread A
std::atomic_thread_fence(std::memory_order_acquire);
if(canGo) produceData();
// in thread B
canGo = true;
std::atomic_thread_fence(std::memory_order_release);
On cpp reference, for this case, it is write that :
all non-atomic and relaxed atomic stores that are sequenced-before FB
in thread B will happen-before all non-atomic and relaxed atomic loads
from the same locations made in thread A after FA
Which solution would you use and why?
There's nothing you can do to make a store visible to other threads any sooner. See If I don't use fences, how long could it take a core to see another core's writes? - barriers don't speed up visibility to other cores, they just make this core wait until that's happened.
The store part of an RMW is not different from a pure store for this, either.
(Certainly on x86; not totally sure about other ISAs, where a relaxed LL/SC might possibly get special treatment from the store buffer, possibly being more likely to commit before other stores if this core can get exclusive ownership of the cache line. But I think it still would have to retire from out-of-order execution so the core knows it's not speculative.)
Anthony's answer that was linked in comment is misleading; as I commented there:
If the RMW runs before the other thread's store commits to cache, it doesn't see the value, just like if it was a pure load. Does that mean "stale"? No, it just means that the store hasn't happened yet.
The only reason RMWs need a guarantee about "latest" value is that they're inherently serializing operations on that memory location. This is what you need if you want 100 unsynchronized fetch_add operations to not step on each other and be equivalent to += 100, but otherwise best-effort / latest-available value is fine, and that's what you get from a normal atomic load.
If you require instant visibility of results (a nanosecond or so), that's only possible within a single thread, like x = y; x += z;
Also note, the C / C++ standard requirement (actually just a note) to make stores visible in a reasonable amount of time is in addition to the requirements on ordering of operations. It doesn't mean seq_cst store visibility can be delayed until after later loads. All seq_cst operations happen in some interleaving of program order across all threads.
On real-world C++ implementations, the visibility time is entirely up to hardware inter-core latency. But the C++ standard is abstract, and could in theory be implemented on a CPU that required manual flushing to make stores visible to other threads. Then it would be up to the compiler to not be lazy and defer that for "too long".
volatile atomic<T> is useless; compilers already don't optimize atomic<T>, so every atomic access done by the abstract machine will already happen in the asm. (Why don't compilers merge redundant std::atomic writes?). That's all that volatile does, so volatile atomic<T> compiles to the same asm as atomic<T> for anything you can with the atomic.
Defining "stale" is a problem because separate threads running on separate cores can't see each other's actions instantly. It takes tens of nanoseconds on modern hardware to see a store from another thread.
But you can't read "stale" values from cache; that's impossible because real CPUs have coherent caches. (That's why volatile int could be used to roll your own atomics before C++11, but is no longer useful.) You may need an ordering stronger than relaxed to get the semantics you want as far as one value being older than another (i.e. "reordering", not "stale"). But for a single value, if you don't see a store, that means your load executed before the other core took exclusive ownership of the cache line in order to commit its store. i.e. that the store hasn't truly happened yet.
In the formal ISO C++ rules, there are guarantees about what value you're allowed to see which effectively give you the guarantees you'd expect from cache coherency for a single object, like that after a reader sees a store, further loads in this thread won't see some older store and then eventually back to the newest store. (https://eel.is/c++draft/intro.multithread#intro.races-19).
(Note for 2 writers + 2 readers with non-seq_cst operations, it's possible for the readers to disagree about the order in which the stores happened. This is called IRIW reordering, but most hardware can't do it; only some PowerPC. Will two atomic writes to different locations in different threads always be seen in the same order by other threads? - so it's not always quite as simple as "the store hasn't happened yet", it be visible to some threads before others. But it's still true that you can't speed up visibility, only for example slow down the readers so none of them see it via the "early" mechanism, i.e. with hwsync for the PowerPC loads to drain the store buffer first.)
We know for sure that the functionThatWillBeLaunchedInThreadAfunction
will be launched AFTER the end of the
functionThatWillBeLaunchedInThreadB.
First of all, if this is the case then it's likely that your task queue mechanism takes care of the necessary synchronization already.
On to the answer...
By far the simplest thing to do is acquire/release ordering. All the solutions you gave are worse.
std::atomic_bool canBegin{false};
void functionThatWillBeLaunchedInThreadA() {
if(canBegin.load(std::memory_order_acquire))
produceData();
}
void functionThatWillBeLaunchedInThreadB() {
canBegin.store(true, std::memory_order_release);
}
By the way, shouldn't this be a while loop?
void functionThatWillBeLaunchedInThreadA() {
while (!canBegin.load(std::memory_order_acquire))
{ }
produceData();
}
I don't need to protect any data, so acquire / consume / release
ordering on atomic store/load are not needed here (I think?)
In this case, the ordering is required to keep the compiler/CPU/memory subsystem from ordering the canBegin store true before the previous reads/writes have completed. And it should actually stall the CPU until it can be guaranteed that every write that comes before in program order will propagate before the store to canBegin. On the load side it prevents memory from being read/written before canBegin is read as true.
However, in such a code, we don't have any guarantee that the store
will be visible in the thread A, so the thread A can read a stale
value (false).
You said yourself:
a store on an atomic variable should be visible to other thread in a
"within a reasonnable amount of time".
Even with relaxed memory order, a write is guaranteed to eventually reach the other cores and all cores will eventually agree on any given variable's store history, so there are no stale values. There are only values that haven't propagated yet. What's "relaxed" about it is the store order in relation to other variables. Thus, memory_order_relaxed solves the stale read problem (but doesn't address the ordering required as discussed above).
Don't use volatile. It doesn't provide all the guarantees required of atomics in the C++ memory model, so using it would be undefined behavior. See https://en.cppreference.com/w/cpp/atomic/memory_order#Relaxed_ordering at the bottom to read about it.
You could use a mutex or spinlock, but a mutex operation is much more expensive than a lock-free std::atomic acquire-load/release-store. A spinlock will do at least one atomic read-modify-write operation...and possibly many. A mutex is definitely overkill. But both have the benefit of simplicity in the C++ source. Most people know how to use locks so it's easier to demonstrate correctness.
A memory fence will also work but your fences are in the wrong spot (it's counter-intuitive) and the inter-thread communication variable should be std::atomic. (Careful when playing these games...! It's easy to get undefined behavior) Relaxed ordering is ok thanks to the fences.
std::atomic<bool> canGo{false}; // MUST be atomic
// in thread A
if(canGo.load(std::memory_order_relaxed))
{
std::atomic_thread_fence(std::memory_order_acquire);
produceData();
}
// in thread B
std::atomic_thread_fence(std::memory_order_release);
canGo.store(true, memory_order_relaxed);
The memory fences are actually more strict than acquire/release ordering on the std::atomicload/store so this gains nothing and could be more expensive.
It seems like you really want to avoid overhead with your signaling mechanism. This is exactly what the std::atomic acquire/release semantics were invented for! You are worrying too much about stale values. Yes, an atomic RMW will give you the "latest" value, but they're also very expensive operations themselves. I want to give you an idea of how fast acquire/release is. It's most likely that you're targeting x86. x86 has total store order and word-sized loads/stores are atomic, so an load acquire compiles to just a regular load and and a release store compiles to a regular store. So it turns out that almost everything in this long post will probably compile to exactly the same code anyway.

Is it possible that a store with memory_order_relaxed never reaches other threads?

Suppose I have a thread A that writes to an atomic_int x = 0;, using x.store(1, std::memory_order_relaxed);. Without any other synchronization methods, how long would it take before other threads can see this, using x.load(std::memory_order_relaxed);? Is it possible that the value written to x stays entirely thread-local given the current definition of the C/C++ memory model that the standard gives?
The practical case that I have at hand is where a thread B reads an atomic_bool frequently to check if it has to quit; Another thread, at some point, writes true to this bool and then calls join() on thread B. Clearly I do not mind to call join() before thread B can even see that the atomic_bool was set, nor do I mind when thread B already saw the change and exited execution before I call join(). But I am wondering: using memory_order_relaxed on both sides, is it possible to call join() and block "forever" because the change is never propagated to thread B?
Edit
I contacted Mark Batty (the brain behind mathematically verifying and subsequently fixing the C++ memory model requirements). Originally about something else (which turned out to be a known bug in cppmem and his thesis; so fortunately I didn't make a complete fool of myself, and took the opportunity to ask him about this too; his answer was:
Q: Can it theoretically be that such a store [memory_order_relaxed without (any following) release operation] never reaches the other thread?
Mark: Theoretically, yes, but I don't think that has been observed.
Q: In other words, do relaxed stores make no sense
whatsoever unless you combine them with some release operation (and
acquire on the other thread), assuming you want another thread to
see it?
Mark: Nearly all of the use cases for them do use release and acquire, yes.
This is all the standard has to say on the matter, I believe:
[intro.multithread]/25 An implementation should ensure that the last value (in modification order) assigned by an atomic or synchronization operation will become visible to all other threads in a finite period of time.
In practice
Without any other synchronization methods, how long would it take
before other threads can see this, using
x.load(std::memory_order_relaxed);?
No time. It's a normal write, it goes to the store buffer, so it will be available in the L1d cache in less time than a blink. But that's only when the assembly instruction is run.
Instructions can be reordered by the compiler, but no reasonable compiler would reorder atomic operation over arbitrarily long loops.
In theory
Q: Can it theoretically be that such a store [memory_order_relaxed
without (any following) release operation] never reaches the other
thread?
Mark: Theoretically, yes,
You should have asked him what would happen if the "following release fence" was added back. Or with atomic store release operation.
Why wouldn't these be reordered and delayed a loooong time? (so long that it seems like an eternity in practice)
Is it possible that the value written to x stays entirely thread-local
given the current definition of the C/C++ memory model that the
standard gives?
If an imaginary and especially perverse implementation wanted to delay the visibility of atomic operation, why would it do that only for relaxed operations? It could well do it for all atomic operations.
Or never run some threads.
Or run some threads so slowly that you would believe they aren't running.
This is what the standard says in 29.3.12:
Implementations should make atomic stores visible to atomic loads within a reasonable amount of time.
There is no guarantee a store will become visible in another thread, there is no guaranteed timing and there is no formal relationship with memory order.
Of course, on each regular architecture a store will become visible, but on rare platforms that do not support cache coherency, it may never become visible to a load.
In that case, you would have to reach for an atomic read-modify-write operation to get the latest value in the modification order.