ARM64 64 bit load/store data race - c++

According to this, a 64 bit load/store is considered to be an atomic access on arm64. Given this, is the following program still considered to have a data race (and thus can exhibit UB) when compiled for arm64 (ignore ordering with respect to other memory accesses)
uint64_t x;
// Thread 1
void f()
{
uint64_t a = x;
}
// Thread 2
void g()
{
x = 1;
}
If instead I switch this to using
std::atomic<uint64_t> x{};
// Thread 1
void f()
{
uint64_t a = x.load(std::memory_order_relaxed);
}
// Thread 2
void g()
{
x.store(1, std::memory_order_relaxed);
}
Is the second program considered data race free?
On arm64, it looks like the compiler ends up generating the same instruction for a normal 64 bit load/store and a load/store of an atomic with memory_order_relaxed, so what's the difference?

std::atomic solves 4 problems.
One is that load/store is atomic, meaning you don't get loads and stores intermixed so that for example you load 32bit from before a store and the other 32bit from after a store. Normally everything up to register size is naturally atomic in that sense on the CPU itself. Things might break with unaligned access, potentially only when the access crosses a cacheline. In std::atmoic<T> implementations you will see the use of locks when the size of T exceeds the size the CPU reads/writes atomically on it's own.
The other thing std::atomic does is synchronize access between threads. Just because one thread writes data to a variable doesn't mean another thread sees that data appear instantly. The writing cpu puts the data into it's store buffer hoping it just gets overwritten again or adjacent memory gets written and the 2 writes can be combined. After a while the data goes to L1 cache where it can stay even longer, then L2 and L3. Depending on the architecture cache may or may not be shared between CPU cores. They also might not synchronize automatically. So when you want to access the same memory address from multiple cores you have to tell the CPU to synchronize the access with other cores.
The third thing has to with modern CPUs doing out-of-order execution and speculative execution. That means even if the code checks a variable and then reads a second variable the CPU might read the second variable first. If the first variable acts as a semaphore signaling the second variable is ready to be read then this can fail because the read happens before the data is ready. The std::atomic adds barriers preventing the CPU to do these reorderings so reads and writes happen in a specific order in the hardware.
The fourth thing is much the same but for the compiler. std::atomic prevents the compiler from reordering instructions across it. Or from optimizing multiple reads or writes into just one.
All of this std::atomic does automatiocaly for you if you just use it without specifying any memory order. The default memory order is the strongest order.
But when you use
uint64_t a = x.load(std::memory_order_relaxed);
you tell the compiler to ignore most of the things:
Relaxed operation: there are no synchronization or ordering constraints imposed on other reads or writes, only this operation's atomicity is guaranteed
So you instructed the compiler not to care about synchronizing with other threads or caches or to preserve the order the instructions are written. All you care about is that reads or writes are not broken up into 2 or more parts where you could get mixed data. The load will get either the whole data from before the store or the whole data from after the store in the other thread. But it's completely undefined which of the two values you get. Which is what you get for all 64bit load/store for free so the code is identical.
Note: if you have multiple atomics then accessing one with a stronger memory order will synchronize both of them. So you can see code that will do one load with a strong order together with others with weak order. Same for groups of writes. This can speed up access. But it's hard to get right.

Whether or not an access is a data race in the sense of the C++ language standard is independent of the underlying hardware. The language has its own memory model and even if a straight-forward compilation to the target architecture would be free of problems, the compiler may still optimize based on the assumption that the program is free of data races in the sense of the C++ memory model.
Accessing a non-atomic in two threads without synchronization with one of them being a write is always a data race in the C++ model. So yes, the first program has a data race and therefore undefined behavior.
In the second program the object is an atomic, so there cannot be a data race.

Related

Do I need writer/reader synchronization if I don't care if reader will read stale data?

Let I have a reader thread. Reader has a vector of bools. Size of the vector isn't changed and always known. Reader reads some data from another source, calculates an index from the data and checks if vector[index] == true. If true, Reader sends data further. If not, drops data.
Let I have a writer thread. Writer makes vector[index] true or false.
Do I really need a mutex for vector if I don't bother that some extra data chunks will be sent or some chunks will be lost? Is it absolutely safe to use a vector in this way?
Reading and writing the same value, however small, from multiple threads without synchronization, is a data race, a form of undefined behavior.
Even if the hardware guarantees cache coherency (as in x86), the C++ memory model is defined such that in the absence of synchronization each thread is assumed to be executing in isolation. Then according to the as-if rule the compiler is allowed to optimize away and reorder memory accesses any way it sees fit, so the behavior of a program with a data race becomes unpredictable. The reader thread may never "see" any updated value, for example. Or the writer may not write anything to memory until the thread is finished, or write in a different order. The behavior may change between compiler versions, optimization levels, etc.
Note that synchronization doesn't mean a mutex, an atomic will do too (a vector of atomics is somewhat complicated, but is possible too, though my feeling is that a userspace mutex would be more efficient).
Bonus note: don't forget about false sharing when accessing the same vector from multiple threads.
As rustyx already indicated, atomics could do the trick.
If you just care about reading the value at some point in the future and not suffer from a data race (so the lack of a happens before relation between the write and the read), then it would be sufficient to set the flags using a memory_order_release and get the flags using a memory_order_acquire.
On the X86; which uses the TSO memory model, all regular stores are release stores and all regular loads are acquire loads. So on a hardware level there is no price to pay. Only the compiler will be prevented from doing certain reorderings.
The expensive write on an X86 is the memory_order_sec_cst. In that case, the store is put on the store buffer and the CPU stops executing any loads till the store buffer has been drained. With a memory_order_sec_cst, the store is placed on the store buffer and the CPU can continue with the next instruction (even loads); so the CPU is not stalled.

memory model, how load acquire semantic actually works?

From very nice Paper and article about memory reordering.
Q1: I understand that cache-coherence, store buffer and invalidation queue is root cause of memory reordering ?
Store release is quite understandable, have to wait for all load and store are completed before set flag to true.
About load acquire, typical use of atomic load is waiting for a flag. Suppose we have 2 threads:
int x = 0;
std::atomic<bool> ready_flag = false;
// thread-1
if(ready_flag.load(std::memory_order_relaxed))
{
// (1)
// load x here
}
// (2)
// load x here
// thread-2
x = 100;
ready_flag.store(true, std::memory_order_release);
EDIT: in thread-1, it should be a while loop, but I copied the logic from article above. So, assume memory-reorder is occurred just in time.
Q2: Because (1) and (2) depends on if condition, CPU have to wait for ready_flag, does it mean write-release is enough ? How memory-reordering can happens with this context ?
Q3: Obviously we have load-acquire, so I guess mem-reorder is possible, then where should we place the fence, (1) or (2) ?
Accessing an atomic variable is not a mutex operation; it merely accesses the stored value atomically, with no chance for any CPU operation to interrupt the access such that no data races can occur with regard to accessing that value (it can also issue barriers with regard to other accesses, which is what the memory orders provide). But that's it; it doesn't wait for any particular value to appear in the atomic variable.
As such, your if statement will read whatever value happens to be there at the time. If you want to guard access to x until the other statement has written to it and signaled the atomic, you must:
Not allow any code to read from x until the atomic flag has returned the value true. Simply testing the value once won't do that; you must loop over repeated accesses until it is true. Any other attempt to read from x results in a data race and is therefore undefined behavior.
Whenever you access the flag, you must do so in a way that tells the system that values written by the thread setting that flag should be visible to subsequent operations that see the set value. That requires a proper memory order, one which must be at least memory_order_acquire.
To be technical, the read from the flag itself doesn't have to do the acquire. You could perform an acquire operation after having read the proper value from the flag. But you need to have an acquire-equivalent operation happen before reading x.
The writing statement must set the flag using a releasing memory order that must be at least as powerful as memory_order_release.
Because (1) and (2) depends on if condition, CPU have to wait for ready_flag
There are 2 showstopper flaws in that reasoning:
Branch prediction + speculative execution is a real thing in real CPUs. Control dependencies behave differently from data dependencies. Speculative execution breaks control dependencies.
In most (but not all) real CPUs, data dependencies do work like C++ memory_order_consume. A typical use-case is loading a pointer and then dereferencing it. That's still not safe in C++'s very weak memory model, but will happen to compile to asm that works for most ISAs other than DEC Alpha. Alpha can (in practice on some hardware) even manage to violate causality and load a stale value when dereferencing a just-loaded pointer, even if the stores were correctly ordered.
Compilers can break control and even data dependencies. C++ source logic doesn't always translate directly to asm. In this case a compiler could emit asm that works like this:
tmp = load(x); // compile time reordering before the relaxed load
if (load(ready_flag)
actually use tmp;
It's data-race UB in C++ to read x while it might still be being written, but for most specific ISAs there's no problem with that. You just have to avoid actually using any load results that might be bogus.
This might not be a useful optimization for most ISAs but nothing rules it out. Hiding load latency on in-order pipelines by doing the load earlier might actually be useful sometimes, (if it wasn't being written by another thread, and the compiler might guess that wasn't happening because there's no acquire load).
By far your best bet is to use ready_flag.load(mo_acquire).
A separate problem is that you have commented out code that reads x after the if(), which will run even if the load didn't see the data ready. As #Nicol explained in an answer, this means data-race UB is possible because you might be reading x while the producer is writing it.
Perhaps you wanted to write a spin-wait loop like while(!ready_flag){ _mm_pause(); }? Generally be careful of wasting huge amounts of CPU time spinning; if it might be a long time, use a library-supported thing like maybe a condition variable that gives you efficient fallback to OS-supported sleep/wakeup (e.g. Linux futex) after spinning for a short time.
If you did want a manual barrier separate from the load, it would be
if (ready_flag.load(mo_relaxed))
atomic_thread_fence(mo_acquire);
int tmp = x; // now this is safe
}
// atomic_thread_fence(mo_acquire); // still wouldn't make it safe to read x
// because this code runs even after ready_flag == false
Using if(ready_flag.load(mo_acquire)) would lead to an unconditional fence before branching on the ready_flag load, when compiling for any ISA where acquire-load wasn't available with a single instruction. (On x86 all loads are acquire, on AArch64 ldar does an acquire load. ARM needs load + dsb ish)
The C++ standard doesn't specify the code generated by any particular construct; only correct combinations of thread communication tools product a guaranteed result.
You don't get guarantees from the CPU in C++ because C++ is not a kind of (macro) assembly, not even a "high level assembly", at least not when not all objects have a volatile type.
Atomic objects are communication tools to exchange data between threads. The correct use, for correct visibility of memory operations, is either a store operation with (at least) release followed by a load with acquire, the same with RMW in between, either the store (resp. the load) replaced by RMW with (at least) a release (resp. acquire), on any variant with a relaxed operation and a separate fence.
In all cases:
the thread "publishing" the "done" flag must use a memory ordering at least release (that is: release, release+acquire or sequential consistency),
and the "subscribing" thread, the one acting on the flag must use at least acquire (that is: acquire, release+acquire or sequential consistency).
In practice with separately compiled code other modes might work, depending on the CPU.

Do we need to use lock for multi-threaded x32 system for just reading or writing into a uint32_t variable

I have a question:
Consider a x32 System,
therefore for a uint32_t variable does the system read and write to it atomically?
Meaning, that the entire operation of read or write can be completed in one instruction cycle.
If this is the case then for a multi-threaded x32 system we wont have to use locks for just reading or writing into a uint32_t variable.
Please confirm my understanding.
It is only atomic if you write the code in assembler and pick the appropriate instruction. When using a higher level language, you don't have any control over which instructions that will get picked.
If you have some C code like a = b; then the machine code generated might be "load b into register x", "store register x in the memory location of a", which is more than one instruction. An interrupt or another thread executed between those two will mean data corruption if it uses the same variable. Suppose the other thread writes a different value to a - then that change will be lost when returning to the original thread.
Therefore you must use some manner of protection mechanism, such as _Atomic qualifier, mutex or critical sections.
Yes, one needs to use locks or some other appropriate mechanism, like the atomics.
C11 5.1.2.4p4:
Two expression evaluations conflict if one of them modifies a memory location and the other one reads or modifies the same memory location.
C11 5.1.2.4p25:
The execution of a program contains a data race if it contains two conflicting actions in different threads, at least one of which is not atomic, and neither happens before the other. Any such data race results in undefined behavior.
Additionally if you've got a variable that is not volatile-qualified then the C standard does not even require that the changes hit to the memory at all; unless you use some synchronization mechanism the data races could have much longer spans in an optimized program than one would perhaps initially think could be possible - for example the writes can be totally out of order and so forth.
The usage of locks is not (only) to ensure atomicity, 32-bit variables are already been written atomically.
Your problem is to protect simultaneous writing:
int x = 0;
Function 1: x++;
Function 2: x++;
If there is no synchronization, x might up end as 1 instead of 2 because function 2 might be reading x = 0, before function 1 modifies x. The worst thing in all this is that it might happen or not at random (or at your client's PC), so debugging is difficult.
The issue is that variables aren't updated instantly.
Each processor's core has its own private memory (L1 and L2 caches). So if you modify a variable, say x++, in two different threads on two different cores - then each core updates their own version of x.
Atomic operations and mutexes ensure synchronization of these variables with the shared memory (RAM / L3 cache).

relaxed ordering and inter thread visibility

I learnt from relaxed ordering as a signal that a store on an atomic variable should be visible to other thread in a "within a reasonnable amount of time".
That say, I am pretty sure it should happen in a very short time (some nano second ?).
However, I don't want to rely on "within a reasonnable amount of time".
So, here is some code :
std::atomic_bool canBegin{false};
void functionThatWillBeLaunchedInThreadA() {
if(canBegin.load(std::memory_order_relaxed))
produceData();
}
void functionThatWillBeLaunchedInThreadB() {
canBegin.store(true, std::memory_order_relaxed);
}
Thread A and B are within a kind of ThreadPool, so there is no creation of thread or whatsoever in this problem.
I don't need to protect any data, so acquire / consume / release ordering on atomic store/load are not needed here (I think?).
We know for sure that the functionThatWillBeLaunchedInThreadAfunction will be launched AFTER the end of the functionThatWillBeLaunchedInThreadB.
However, in such a code, we don't have any guarantee that the store will be visible in the thread A, so the thread A can read a stale value (false).
Here are some solution I think about.
Solution 1 : Use volatility
Just declare volatile std::atomic_bool canBegin{false}; Here the volatileness guarantee us that we will not see stale value.
Solution 2 : Use mutex or spinlock
Here the idea is to protect the canBegin access via a mutex / spinlock that guarantee via a release/acquire ordering that we will not see a stale value.
I don't need canGo to be an atomic either.
Solution 3 : not sure at all, but memory fence?
Maybe this code will not work, so, tell me :).
bool canGo{false}; // not an atomic value now
// in thread A
std::atomic_thread_fence(std::memory_order_acquire);
if(canGo) produceData();
// in thread B
canGo = true;
std::atomic_thread_fence(std::memory_order_release);
On cpp reference, for this case, it is write that :
all non-atomic and relaxed atomic stores that are sequenced-before FB
in thread B will happen-before all non-atomic and relaxed atomic loads
from the same locations made in thread A after FA
Which solution would you use and why?
There's nothing you can do to make a store visible to other threads any sooner. See If I don't use fences, how long could it take a core to see another core's writes? - barriers don't speed up visibility to other cores, they just make this core wait until that's happened.
The store part of an RMW is not different from a pure store for this, either.
(Certainly on x86; not totally sure about other ISAs, where a relaxed LL/SC might possibly get special treatment from the store buffer, possibly being more likely to commit before other stores if this core can get exclusive ownership of the cache line. But I think it still would have to retire from out-of-order execution so the core knows it's not speculative.)
Anthony's answer that was linked in comment is misleading; as I commented there:
If the RMW runs before the other thread's store commits to cache, it doesn't see the value, just like if it was a pure load. Does that mean "stale"? No, it just means that the store hasn't happened yet.
The only reason RMWs need a guarantee about "latest" value is that they're inherently serializing operations on that memory location. This is what you need if you want 100 unsynchronized fetch_add operations to not step on each other and be equivalent to += 100, but otherwise best-effort / latest-available value is fine, and that's what you get from a normal atomic load.
If you require instant visibility of results (a nanosecond or so), that's only possible within a single thread, like x = y; x += z;
Also note, the C / C++ standard requirement (actually just a note) to make stores visible in a reasonable amount of time is in addition to the requirements on ordering of operations. It doesn't mean seq_cst store visibility can be delayed until after later loads. All seq_cst operations happen in some interleaving of program order across all threads.
On real-world C++ implementations, the visibility time is entirely up to hardware inter-core latency. But the C++ standard is abstract, and could in theory be implemented on a CPU that required manual flushing to make stores visible to other threads. Then it would be up to the compiler to not be lazy and defer that for "too long".
volatile atomic<T> is useless; compilers already don't optimize atomic<T>, so every atomic access done by the abstract machine will already happen in the asm. (Why don't compilers merge redundant std::atomic writes?). That's all that volatile does, so volatile atomic<T> compiles to the same asm as atomic<T> for anything you can with the atomic.
Defining "stale" is a problem because separate threads running on separate cores can't see each other's actions instantly. It takes tens of nanoseconds on modern hardware to see a store from another thread.
But you can't read "stale" values from cache; that's impossible because real CPUs have coherent caches. (That's why volatile int could be used to roll your own atomics before C++11, but is no longer useful.) You may need an ordering stronger than relaxed to get the semantics you want as far as one value being older than another (i.e. "reordering", not "stale"). But for a single value, if you don't see a store, that means your load executed before the other core took exclusive ownership of the cache line in order to commit its store. i.e. that the store hasn't truly happened yet.
In the formal ISO C++ rules, there are guarantees about what value you're allowed to see which effectively give you the guarantees you'd expect from cache coherency for a single object, like that after a reader sees a store, further loads in this thread won't see some older store and then eventually back to the newest store. (https://eel.is/c++draft/intro.multithread#intro.races-19).
(Note for 2 writers + 2 readers with non-seq_cst operations, it's possible for the readers to disagree about the order in which the stores happened. This is called IRIW reordering, but most hardware can't do it; only some PowerPC. Will two atomic writes to different locations in different threads always be seen in the same order by other threads? - so it's not always quite as simple as "the store hasn't happened yet", it be visible to some threads before others. But it's still true that you can't speed up visibility, only for example slow down the readers so none of them see it via the "early" mechanism, i.e. with hwsync for the PowerPC loads to drain the store buffer first.)
We know for sure that the functionThatWillBeLaunchedInThreadAfunction
will be launched AFTER the end of the
functionThatWillBeLaunchedInThreadB.
First of all, if this is the case then it's likely that your task queue mechanism takes care of the necessary synchronization already.
On to the answer...
By far the simplest thing to do is acquire/release ordering. All the solutions you gave are worse.
std::atomic_bool canBegin{false};
void functionThatWillBeLaunchedInThreadA() {
if(canBegin.load(std::memory_order_acquire))
produceData();
}
void functionThatWillBeLaunchedInThreadB() {
canBegin.store(true, std::memory_order_release);
}
By the way, shouldn't this be a while loop?
void functionThatWillBeLaunchedInThreadA() {
while (!canBegin.load(std::memory_order_acquire))
{ }
produceData();
}
I don't need to protect any data, so acquire / consume / release
ordering on atomic store/load are not needed here (I think?)
In this case, the ordering is required to keep the compiler/CPU/memory subsystem from ordering the canBegin store true before the previous reads/writes have completed. And it should actually stall the CPU until it can be guaranteed that every write that comes before in program order will propagate before the store to canBegin. On the load side it prevents memory from being read/written before canBegin is read as true.
However, in such a code, we don't have any guarantee that the store
will be visible in the thread A, so the thread A can read a stale
value (false).
You said yourself:
a store on an atomic variable should be visible to other thread in a
"within a reasonnable amount of time".
Even with relaxed memory order, a write is guaranteed to eventually reach the other cores and all cores will eventually agree on any given variable's store history, so there are no stale values. There are only values that haven't propagated yet. What's "relaxed" about it is the store order in relation to other variables. Thus, memory_order_relaxed solves the stale read problem (but doesn't address the ordering required as discussed above).
Don't use volatile. It doesn't provide all the guarantees required of atomics in the C++ memory model, so using it would be undefined behavior. See https://en.cppreference.com/w/cpp/atomic/memory_order#Relaxed_ordering at the bottom to read about it.
You could use a mutex or spinlock, but a mutex operation is much more expensive than a lock-free std::atomic acquire-load/release-store. A spinlock will do at least one atomic read-modify-write operation...and possibly many. A mutex is definitely overkill. But both have the benefit of simplicity in the C++ source. Most people know how to use locks so it's easier to demonstrate correctness.
A memory fence will also work but your fences are in the wrong spot (it's counter-intuitive) and the inter-thread communication variable should be std::atomic. (Careful when playing these games...! It's easy to get undefined behavior) Relaxed ordering is ok thanks to the fences.
std::atomic<bool> canGo{false}; // MUST be atomic
// in thread A
if(canGo.load(std::memory_order_relaxed))
{
std::atomic_thread_fence(std::memory_order_acquire);
produceData();
}
// in thread B
std::atomic_thread_fence(std::memory_order_release);
canGo.store(true, memory_order_relaxed);
The memory fences are actually more strict than acquire/release ordering on the std::atomicload/store so this gains nothing and could be more expensive.
It seems like you really want to avoid overhead with your signaling mechanism. This is exactly what the std::atomic acquire/release semantics were invented for! You are worrying too much about stale values. Yes, an atomic RMW will give you the "latest" value, but they're also very expensive operations themselves. I want to give you an idea of how fast acquire/release is. It's most likely that you're targeting x86. x86 has total store order and word-sized loads/stores are atomic, so an load acquire compiles to just a regular load and and a release store compiles to a regular store. So it turns out that almost everything in this long post will probably compile to exactly the same code anyway.

How do I make memory stores in one thread "promptly" visible in other threads?

Suppose I wanted to copy the contents of a device register into a variable that would be read by multiple threads. Is there a good general way of doing this? Here are examples of two possible methods of doing this:
#include <atomic>
volatile int * const Device_reg_ptr = reinterpret_cast<int *>(0x666);
// This variable is read by multiple threads.
std::atomic<int> device_reg_copy;
// ...
// Method 1
const_cast<volatile std::atomic<int> &>(device_reg_copy)
.store(*Device_reg_ptr, std::memory_order_relaxed);
// Method 2
device_reg_copy.store(*Device_reg_ptr, std::memory_order_relaxed);
std::atomic_thread_fence(std::memory_order_release);
More generally, in the face of possible whole program optimization, how does one correctly control the latency of memory writes in one thread being visible in other threads?
EDIT: In your answer, please consider the following scenario:
The code is running on a CPU in an embedded system.
A single application is running on the CPU.
The application has far fewer threads than the CPU has processor cores.
Each core has a massive number of registers.
The application is small enough that whole program optimization is successfully used when building its executable.
How do we make sure that a store in one thread does not remain invisible to other threads indefinitely?
If you would like to update the value of device_reg_copy in atomic fashion, then device_reg_copy.store(*Device_reg_ptr, std::memory_order_relaxed); suffices.
There is no need to apply volatile to atomic variables, it is unnecessary.
std::memory_order_relaxed store is supposed to incur the least amount of synchronization overhead. On x86 it is just a plain mov instruction.
However, if you would like to update it in such a way, that the effects of any preceding stores become visible to other threads along with the new value of device_reg_copy, then use std::memory_order_release store, i.e. device_reg_copy.store(*Device_reg_ptr, std::memory_order_release);. The readers need to load device_reg_copy as std::memory_order_acquire in this case. Again, on x86 std::memory_order_release store is a plain mov.
Whereas if you use the most expensive std::memory_order_seq_cst store, it does insert the memory barrier for you on x86.
This is why they say that x86 memory model is a bit too strong for C++11: plain mov instruction is std::memory_order_release on stores and std::memory_order_acquire on loads. There is no relaxed store or load on x86.
I cannot recommend enough CPU Cache Flushing Fallacy article.
The C++ standard is rather vague about making atomic stores visible to other threads..
29.3.12
Implementations should make atomic stores visible to atomic loads within a reasonable amount of time.
That is as detailed as it gets, there is no definition of 'reasonable', and it does not have to be immediately.
Using a stand-alone fence to force a certain memory ordering is not necessary since you can specify those on atomic operations, but the question is,
what is your expectation with regards to using a memory fence..
Fences are designed to enforce ordering on memory operations (between threads), but they do not guarantee visibility in a timely manner.
You can store a value to an atomic variable with the strongest memory ordering (ie. seq_cst), but even when another thread executes load() at a later time than the store(),
you might still get an old value from the cache and yet (surprisingly) it does not violate the happens-before relationship.
Using a stronger fence might make a difference wrt. timing and visibility, but there are no guarantees.
If prompt visibility is important, I would consider using a Read-Modify-Write (RMW) operation to load the value.
These are atomic operations that read and modify atomically (ie. in a single call), and have the additional property that they are guaranteed to operate on the latest value.
But since they have to reach a little further than the local cache, these calls also tend to be more expensive to execute.
As pointed out by Maxim Egorushkin, whether or not you can use weaker memory orderings than the default (seq_cst) depends on whether other memory operations need to be synchronized (made visible) between threads.
That is not clear from your question, but it is generally considered safe to use the default (sequential consistency).
If you are on an unusually weak platform, if performance is problematic, and if you need data synchronization between threads, you could consider using acquire/release semantics:
// thread 1
device_reg_copy.store(*Device_reg_ptr, std::memory_order_release);
// thread 2
device_reg_copy.fetch_add(0, std::memory_order_acquire);
If thread 2 sees the value written by thread 1, it is guaranteed that memory operations prior to the store in thread 1 are visible after the load in thread 2.
Acquire/Release operations form a pair and they synchronize based on a run-time relationship between the store and load. In other words, if thread 2 does not see the value stored by thread 1,
there are no ordering guarantees.
If the atomic variable has no dependencies on any other data, you can use std::memory_order_relaxed; store ordering is always guaranteed for a single atomic variable.
As mentioned by others, there is no need for volatile when it comes to inter-thread communication with std::atomic.