Largest "atomic" Type? - concurrency

When working in a concurrent, parallel programming language with multiple threads working together on multiple cores and/or multiple sockets, what is the largest value in memory considered to be atomic?
What I mean is: a string, being a sequence of bytes, is decidedly not atomic because a write to that location in memory may take some time to update the string. Therefore, a lock of some kind must be acquired when reading and writing to the string so that other threads don't see corrupted, half-finished results. However, a string on the stack is atomic because AFAIK the stack is not a shared area of memory across threads.
Is the largest guaranteed, lockless unit a bit or a byte, or does it depend on the instruction used to write that byte? Is it possible, for instance, for a thread to read an integer while another thread is moving the value bit-by-bit from a register to the stack to shared memory, causing the reader thread to see a half-written value?
I guess I am asking what the largest atomic value is on x86_64 and what the guarantees are.

The largest atomic instruction in x86-64 is lock cmpxchg16b, which reads and writes 16 bytes atomically.
Although it is usually used to atomically update a 16-byte object in memory, it can also be used to atomically read such a value.
To atomically update a value, load rdx:rax with the prior value and rcx:rbx with the new value. The instruction atomically updates the memory location with the new value only if the prior value hasn't changed.
To atomically read a 16-byte value, load rdx:rax and rcx:rbx with the same value. (It doesn't matter what value, but 0 is a good choice.) The instruction atomically reads the current value into rdx:rax.

Related

ARM64 64 bit load/store data race

According to this, a 64 bit load/store is considered to be an atomic access on arm64. Given this, is the following program still considered to have a data race (and thus can exhibit UB) when compiled for arm64 (ignore ordering with respect to other memory accesses)
uint64_t x;
// Thread 1
void f()
{
uint64_t a = x;
}
// Thread 2
void g()
{
x = 1;
}
If instead I switch this to using
std::atomic<uint64_t> x{};
// Thread 1
void f()
{
uint64_t a = x.load(std::memory_order_relaxed);
}
// Thread 2
void g()
{
x.store(1, std::memory_order_relaxed);
}
Is the second program considered data race free?
On arm64, it looks like the compiler ends up generating the same instruction for a normal 64 bit load/store and a load/store of an atomic with memory_order_relaxed, so what's the difference?
std::atomic solves 4 problems.
One is that load/store is atomic, meaning you don't get loads and stores intermixed so that for example you load 32bit from before a store and the other 32bit from after a store. Normally everything up to register size is naturally atomic in that sense on the CPU itself. Things might break with unaligned access, potentially only when the access crosses a cacheline. In std::atmoic<T> implementations you will see the use of locks when the size of T exceeds the size the CPU reads/writes atomically on it's own.
The other thing std::atomic does is synchronize access between threads. Just because one thread writes data to a variable doesn't mean another thread sees that data appear instantly. The writing cpu puts the data into it's store buffer hoping it just gets overwritten again or adjacent memory gets written and the 2 writes can be combined. After a while the data goes to L1 cache where it can stay even longer, then L2 and L3. Depending on the architecture cache may or may not be shared between CPU cores. They also might not synchronize automatically. So when you want to access the same memory address from multiple cores you have to tell the CPU to synchronize the access with other cores.
The third thing has to with modern CPUs doing out-of-order execution and speculative execution. That means even if the code checks a variable and then reads a second variable the CPU might read the second variable first. If the first variable acts as a semaphore signaling the second variable is ready to be read then this can fail because the read happens before the data is ready. The std::atomic adds barriers preventing the CPU to do these reorderings so reads and writes happen in a specific order in the hardware.
The fourth thing is much the same but for the compiler. std::atomic prevents the compiler from reordering instructions across it. Or from optimizing multiple reads or writes into just one.
All of this std::atomic does automatiocaly for you if you just use it without specifying any memory order. The default memory order is the strongest order.
But when you use
uint64_t a = x.load(std::memory_order_relaxed);
you tell the compiler to ignore most of the things:
Relaxed operation: there are no synchronization or ordering constraints imposed on other reads or writes, only this operation's atomicity is guaranteed
So you instructed the compiler not to care about synchronizing with other threads or caches or to preserve the order the instructions are written. All you care about is that reads or writes are not broken up into 2 or more parts where you could get mixed data. The load will get either the whole data from before the store or the whole data from after the store in the other thread. But it's completely undefined which of the two values you get. Which is what you get for all 64bit load/store for free so the code is identical.
Note: if you have multiple atomics then accessing one with a stronger memory order will synchronize both of them. So you can see code that will do one load with a strong order together with others with weak order. Same for groups of writes. This can speed up access. But it's hard to get right.
Whether or not an access is a data race in the sense of the C++ language standard is independent of the underlying hardware. The language has its own memory model and even if a straight-forward compilation to the target architecture would be free of problems, the compiler may still optimize based on the assumption that the program is free of data races in the sense of the C++ memory model.
Accessing a non-atomic in two threads without synchronization with one of them being a write is always a data race in the C++ model. So yes, the first program has a data race and therefore undefined behavior.
In the second program the object is an atomic, so there cannot be a data race.

Does std::atomic<> gurantee that store() operation is propagated immediately (almost) to other threads with load()?

I have std::atomic<T> atomic_value; (for type T being bool, int32_t, int64_t and any other). If 1st thread does
atomic_value.store(value, std::memory_order_relaxed);
and in 2nd thread at some points of code I do
auto value = atomic_value.load(std::memory_order_relaxed);
How fast is this updated atomic value propagated from 1st thread to 2nd, between CPU cores? (for all CPU models)
Is it propagated almost-immediately? For example up-to speed of cache coherence propagation in Intel, meaning that 0-2 cycles or so. Maybe few more cycles for some other CPU models/manufacturers.
Or this value may stuck un-updated for many many cycles sometimes?
Does atomic guarantee that value is propagated between CPU cores as fast as possible for given CPU?
Maybe if instead on 1st thread I do
atomic_value.store(value, std::memory_order_release);
and on 2nd thread
auto value = atomic_value.load(std::memory_order_acquire);
then will it help to propagate value faster? (notice change of both memory orders) And now with speed guarantee? Or it will be same gurantee of speed as for relaxed order?
As a side question - does replacing relaxed order with release+acquire also synchronizes all modifications in other (non-atomic) variables?
Meaning that in 1st thread everything that was written to memory before store-with-release, is this whole memory guaranteed in 2nd thread to be exactly in final state (same as in 1st thread) at point of load-with-acquire, of course in a case if loaded value was new one (updated).
So this means that for ANY type of std::atomic<> (or std::atomic_flag) point of store-with-release in one thread synchronizes all memory writes before it with point in another thread that does load-with-acquire of same atomic, in a case of course if in other thread value of atomic got updated? (Sure if value in 2nd thread is not yet new then we expect that memory writes have not yet finished)
PS. Why question arose... Because according to name "atomic" it is obvious to conclude (probably miss-conclude) that by default (without extra constraints, i.e. with just relaxed memory order) std::atomic<> just makes any arithmetic operation atomical, and nothing else, no other guarantees about synchronization or speed of propagation. Meaning that write to memory location will be whole (e.g. all 4 bytes at once for int32_t), or exchange with atomic location will do both read-write atomically (actually in a locked fashion), or incrementing a value will do atomically three operations read-add-write.
The C++ standard says only this [C++20 intro.progress p18]:
An implementation should ensure that the last value (in modification order) assigned by an atomic or
synchronization operation will become visible to all other threads in a finite period of time.
Technically this is only a "should", and "finite time" is not very specific. But the C++ standard is broad enough that you can't expect them to specify a particular number of cycles or nanoseconds or what have you.
In practice, you can expect that a call to any atomic store function, even with memory_order_relaxed, will cause an actual machine store instruction to be executed. The value will not just be left in a register. After that, it's out of the compiler's hands and up to the CPU.
(Technically, if you had two or more stores in succession to the same object, with a bounded amount of other work done in between, the compiler would be allowed to optimize out all but the last one, on the basis that you couldn't have been sure anyway that any given load would happen at the right instant to see one of the other values. In practice I don't believe that any compilers currently do so.)
A reasonable expectation for typical CPU architectures is that the store will become globally visible "without unnecessary delay". The store may go into the core's local store buffer. The core will process store buffer entries as quickly as it can; it does not just let them sit there to age like a fine wine. But they still could take a while. For instance, if the cache line is currently held exclusive by another core, you will have to wait until it is released before your store can be committed.
Using stronger memory ordering will not speed up the process; the machine is already making its best efforts to commit the store. Indeed, a stronger memory ordering may actually slow it down; if the store was made with release ordering, then it must wait for all earlier stores in the buffer to commit before it can itself be committed. On strongly-ordered architectures like x86, every store is automatically release, so the store buffer always remains in strict order; but on a weakly ordered machine, using relaxed ordering may allow your store to "jump the queue" and reach L1 cache sooner than would otherwise have been possible.

With memory_order_relaxed how is total order of modification of an atomic variable assured on typical architectures?

As I understand memory_order_relaxed is to avoid costly memory fences that may be needed with more constrained ordering on a particular architecture.
In that case how is total modification order for an atomic variable achieved on popular processors?
EDIT:
atomic<int> a;
void thread_proc()
{
int b = a.load(memory_order_relaxed);
int c = a.load(memory_order_relaxed);
printf(“first value %d, second value %d\n, b, c);
}
int main()
{
thread t1(thread_proc);
thread t2(thread_proc);
a.store(1, memory_order_relaxed);
a.store(2, memory_order_relaxed);
t1.join();
t2.join();
}
What will guarantee that the output won’t be:
first value 1, second value 2
first value 2, second value 1
?
Multi-processors often use the MESI protocol to ensure total store order on a location. Information is transferred at cache-line granularity. The protocol ensures that before a processor modifies the contents of a cache line, all other processors relinquish their copy of the line, and must reload a copy of the modified line. Hence in the example where a processor writes x and then y to the same location, if any processor sees the write of x, it must have reloaded from the modified line, and must relinquish the line again before the writer writes y.
There is usually a specific set of assembly instructions that corresponds to operations on std::atomics, for example an atomic addition on x86 is lock xadd.
By specifying memory order relaxed you can conceptually think of it as telling the compiler "you must use this technique to increment the value, but I impose no other restrictions outside of the standard as-if optimisations rules on top of that". So literally just replacing an add with an lock xadd is likely sufficient under a relaxed ordering constraint.
Also keep in mind 'memory_order_relaxed' specifies a minimum standard that the compiler has to respect. Some intrinsics on some platforms will have implicit hardware barriers, which doesn't violate the constraint for being too constrained.
All atomic operations act in accord with [intro.races]/14:
If an operation A that modifies an atomic object M happens before an operation B that modifies M, then A shall be earlier than B in the modification order of M.
The two stores from the main thread are required to happen in that order, since the two operations are ordered within the same thread. Therefore, they cannot happen outside of that order. If someone sees the value 2 in the atomic, then the first thread must have executed past the point where the value was set to 1, per [intro.races]/4:
All modifications to a particular atomic object M occur in some particular total order, called the modification order of M.
This of course only applies to atomic operations on a specific atomic object; ordering with respect to other things doesn't exist when using relaxed ordering (which is the point).
How does this get achieved on real machines? In whatever way the compiler sees fit to do so. The compiler could decide that, since you're overwriting the value of the variable you just set, then it can remove the first store per the as-if rule. Nobody ever seeing the value 1 is a perfectly legitimate implementation according to the C++ memory model.
But otherwise, the compiler is required to emit whatever is needed to make it work. Note that out-of-order processors aren't typically allowed to complete dependent operations out of order, so that's typically not a problem.
There are two parts in an inter thread communication:
a core that can do loads and stores
the memory system which consists of coherent caches
The issue is the speculative execution in the CPU core.
A processor load and store unit always need to compare addresses in order to avoid reordering two writes to the same location (if it reorders writes at all) or to pre-fetch a stale value that has just been written to (when reads are done early, before previous writes).
Without that feature, any sequence of executable code would be at risk of having its memory accesses completely randomized, seeing values written by a following instruction, etc. All memory locations would be "renamed" in crazy ways with no way for a program to refer to the same (originally named) location twice in a row.
All programs would break.
On the other hand, memory locations in potentially running code can have two "names":
the location that can hold a modifiable value, in L1d
the location that can be decoded as executable code, in L1i
And these are not connected in any way until a special reload code instruction is performed, not only the L1i but also the instruction decoder can have in cache locations that are otherwise modifiable.
[Another complication is when two virtual addresses (used by speculative loads or stores) refer to the same physical addresses (aliasing): that's another conflict that needs to be dealt with.]
Summary: In most cases, a CPU will naturally to provide an order for accesses on each data memory location.
EDIT:
While a core needs to keep track of operations that invalidate speculative execution, mainly a write to a location later read by a speculative instruction. Reads don't conflict with each others and a CPU core might want to keep track of modification of cached memory after a speculative read (making reads happen visibly in advance) and if reads can be executed out of order it's conceivable that a later read might be complete before an earlier read; on why the system would begin a later read first, a possible cause would be if the address computation is easier and complete first.
So a system that can begin reads out of order and that would consider them completed as soon as a value is made available by the cache, and valid as long as no write by the same core ends up conflicting with either read, and does not monitor L1i cache invalidations caused by another CPU wanting to modify a close memory location (possible that one location), such sequence is possible:
decompose the soon to be executed instructions into sequence A which is long a list of sequenced operations ending with a result in r1 and B a shorter sequence ending with a result in r2
run both in parallel, with B producing a result earlier
speculatively try load (r2), noting that a write that address may invalidate the speculation (suppose the location is available in L1i)
then another CPU annoys us stealing the cache line holding location of (r2)
A completes making r1 value available and we can speculatively do load (r1) (which happens to be the same address as (r2)); which stalls until our cache gets back its cache line
the value of the last done load can be different from the first
Neither speculations of A nor B invalided any memory location, as the system doesn't consider either the loss of cache line or the return of a different value by the last load to be an invalidation of a speculation (which would be easy to implement as we have all the information locally).
Here the system sees any read as non conflicting with any local operation that isn't a local write and the loads are done in an order depending on the complexity of A and B and not whichever comes first in program order (the description above doesn't even say that the program order was changed, just that it was ignored by speculation: I have never described which of the loads was first in the program).
So for a relaxed atomic load, a special instruction would be needed on such system.
The cache system
Of course the cache system doesn't change orders of requests, as it works like a global random access system with temporary ownership by cores.

Read a non-atomic variable, atomically?

I have a non-atomic 62-bit double which is incremented in one thread regularly. This access does not need to be atomic. However, this variable is occasionally read (not written) by another thread. If I align the variable on a 64-bit boundary the read is atomic.
However, is there any way I can ensure I do not read the variable mid-way during the increment? Could I call a CPU instruction which serialises the pipeline or something? Memory barrier?
I thought of declaring the variable atomic and using std::memory_order::memory_order_relaxed in my critical thread (and a stricter memory barrier in the rare thread), but it seems to be just as expensive.
Since you tagged x86, this will be x86-specific.
An increment is essentially three parts, read, add, write. The increment is not atomic, but all three steps (the add doesn't count I suppose, it's not observable anyway) of it are as long as the variable does not cross a cache line boundary (this condition is weaker than having to be aligned to its natural alignment, it has been like this as of P6, before that quadwords had to be aligned).
So you already can't read a torn value. The worst you could do is overwrite the variable in between the moments where it is read and the new value is written, but you're only reading it.

Atomic 128-bit Memory Mode Selection

Using gcc, my code has an atomic 128-bit integer than is being written to by one thread, and read concurrently from 31 threads. I don't care about operations on this variable's synchronization with any other memory in my program (i.e. I'm okay with the compiler reordering two writes to two different integers), as long as reads and writes from and to this variable are consistent. I just want the guarantee that the writes to the atomic 128-bit are "eventually" guaranteed to be reflected in the 31 threads reading from that variable.
Is it safe to use a relaxed memory model? What are the gotchas I should look out for?
Relaxed ordering does not guarantee that the value written by the writer thread will ever be visible to any of the reader threads.
It is valid behavior that the readers only ever see the initial value of the variable and none of the changes. However, it is guaranteed that a writer thread always sees at least the changes he himself made to the variable (and possibly, but again not guaranteed, any later change applied by another thread).
In other words: You still get sequential consistency within a single thread, but no consistency whatsoever between different threads.