Read a non-atomic variable, atomically? - c++

I have a non-atomic 62-bit double which is incremented in one thread regularly. This access does not need to be atomic. However, this variable is occasionally read (not written) by another thread. If I align the variable on a 64-bit boundary the read is atomic.
However, is there any way I can ensure I do not read the variable mid-way during the increment? Could I call a CPU instruction which serialises the pipeline or something? Memory barrier?
I thought of declaring the variable atomic and using std::memory_order::memory_order_relaxed in my critical thread (and a stricter memory barrier in the rare thread), but it seems to be just as expensive.

Since you tagged x86, this will be x86-specific.
An increment is essentially three parts, read, add, write. The increment is not atomic, but all three steps (the add doesn't count I suppose, it's not observable anyway) of it are as long as the variable does not cross a cache line boundary (this condition is weaker than having to be aligned to its natural alignment, it has been like this as of P6, before that quadwords had to be aligned).
So you already can't read a torn value. The worst you could do is overwrite the variable in between the moments where it is read and the new value is written, but you're only reading it.

Related

Does std::atomic<> gurantee that store() operation is propagated immediately (almost) to other threads with load()?

I have std::atomic<T> atomic_value; (for type T being bool, int32_t, int64_t and any other). If 1st thread does
atomic_value.store(value, std::memory_order_relaxed);
and in 2nd thread at some points of code I do
auto value = atomic_value.load(std::memory_order_relaxed);
How fast is this updated atomic value propagated from 1st thread to 2nd, between CPU cores? (for all CPU models)
Is it propagated almost-immediately? For example up-to speed of cache coherence propagation in Intel, meaning that 0-2 cycles or so. Maybe few more cycles for some other CPU models/manufacturers.
Or this value may stuck un-updated for many many cycles sometimes?
Does atomic guarantee that value is propagated between CPU cores as fast as possible for given CPU?
Maybe if instead on 1st thread I do
atomic_value.store(value, std::memory_order_release);
and on 2nd thread
auto value = atomic_value.load(std::memory_order_acquire);
then will it help to propagate value faster? (notice change of both memory orders) And now with speed guarantee? Or it will be same gurantee of speed as for relaxed order?
As a side question - does replacing relaxed order with release+acquire also synchronizes all modifications in other (non-atomic) variables?
Meaning that in 1st thread everything that was written to memory before store-with-release, is this whole memory guaranteed in 2nd thread to be exactly in final state (same as in 1st thread) at point of load-with-acquire, of course in a case if loaded value was new one (updated).
So this means that for ANY type of std::atomic<> (or std::atomic_flag) point of store-with-release in one thread synchronizes all memory writes before it with point in another thread that does load-with-acquire of same atomic, in a case of course if in other thread value of atomic got updated? (Sure if value in 2nd thread is not yet new then we expect that memory writes have not yet finished)
PS. Why question arose... Because according to name "atomic" it is obvious to conclude (probably miss-conclude) that by default (without extra constraints, i.e. with just relaxed memory order) std::atomic<> just makes any arithmetic operation atomical, and nothing else, no other guarantees about synchronization or speed of propagation. Meaning that write to memory location will be whole (e.g. all 4 bytes at once for int32_t), or exchange with atomic location will do both read-write atomically (actually in a locked fashion), or incrementing a value will do atomically three operations read-add-write.
The C++ standard says only this [C++20 intro.progress p18]:
An implementation should ensure that the last value (in modification order) assigned by an atomic or
synchronization operation will become visible to all other threads in a finite period of time.
Technically this is only a "should", and "finite time" is not very specific. But the C++ standard is broad enough that you can't expect them to specify a particular number of cycles or nanoseconds or what have you.
In practice, you can expect that a call to any atomic store function, even with memory_order_relaxed, will cause an actual machine store instruction to be executed. The value will not just be left in a register. After that, it's out of the compiler's hands and up to the CPU.
(Technically, if you had two or more stores in succession to the same object, with a bounded amount of other work done in between, the compiler would be allowed to optimize out all but the last one, on the basis that you couldn't have been sure anyway that any given load would happen at the right instant to see one of the other values. In practice I don't believe that any compilers currently do so.)
A reasonable expectation for typical CPU architectures is that the store will become globally visible "without unnecessary delay". The store may go into the core's local store buffer. The core will process store buffer entries as quickly as it can; it does not just let them sit there to age like a fine wine. But they still could take a while. For instance, if the cache line is currently held exclusive by another core, you will have to wait until it is released before your store can be committed.
Using stronger memory ordering will not speed up the process; the machine is already making its best efforts to commit the store. Indeed, a stronger memory ordering may actually slow it down; if the store was made with release ordering, then it must wait for all earlier stores in the buffer to commit before it can itself be committed. On strongly-ordered architectures like x86, every store is automatically release, so the store buffer always remains in strict order; but on a weakly ordered machine, using relaxed ordering may allow your store to "jump the queue" and reach L1 cache sooner than would otherwise have been possible.

std::atomic - behaviour of relaxed ordering

Can the following call to print result in outputting stale/unintended values?
std::mutex g;
std::atomic<int> seq;
int g_s = 0;
int i = 0, j = 0, k = 0; // ignore fact that these could easily made atomic
// Thread 1
void do_work() // seldom called
{
// avoid over
std::lock_guard<std::mutex> lock{g};
i++;
j++;
k++;
seq.fetch_add(1, std::memory_order_relaxed);
}
// Thread 2
void consume_work() // spinning
{
const auto s = g_s;
// avoid overhead of constantly acquiring lock
g_s = seq.load(std::memory_order_relaxed);
if (s != g_s)
{
// no lock guard
print(i, j, k);
}
}
TL:DR: this is super broken; use a Seq Lock instead. Or RCU if your data structure is bigger.
Yes, you have data-race UB, and in practice stale values are likely; so are inconsistent values (from different increments). ISO C++ has nothing to say about what will happen, so it depends on how it happens to compile for some real machine, and interrupts / context switches in the reader that happen in the middle of reading some of these multiple vars. e.g. if the reader sleeps for any reason between reading i and j, you could miss many updates, or at least get a j that doesn't match your i.
Relaxed seq with writer+reader using lock_guard
I'm assuming the writer would look the same, so the atomic RMW increment is inside the critical section.
I'm picturing the reader checking seq like it is now, and only taking a lock after that, inside the block that runs print.
Even if you did use lock_guard to make sure the reader got a consistent snapshot of all three variables (something you couldn't get from making each of them separately atomic), I'm not sure relaxed would be sufficient in theory. It might be in practice on most real implementations for real machines (where compilers have to assume there might be a reader that synchronizes a certain way, even if there isn't in practice). I'd use at least release/acquire for seq, if I was going to take a lock in the reader.
Taking a mutex is an acquire operation, same as a std::memory_order_acquire load on the mutex object. A relaxed increment inside a critical section can't become visible to other threads until after the writer has taken the lock.
But in the reader, with if( xyz != seq.load(relaxed) ) { take_lock; ... }, the load is not guaranteed to "happen before" taking the lock. In practice on many ISAs it will, especially x86 where all atomic RMWs are full memory barriers. But in ISO C++, and maybe some real implementations, it's possible for the relaxed load to reorder into the reader's critical section. Of course, ISO C++ doesn't define things in terms of "reordering", only in terms of syncing with and values loads are allowed to see.
(This reordering may not be fully plausible; it would mean the read side would have to actually take the lock based on branch prediction / speculation on the load result. Maybe with lock elision like x86 did with transactional memory, except without x86's strong memory ordering?)
Anyway, it's pretty hairly to reason about, and release / acquire ops are quite cheap on most CPUs. If you expected it to be expensive, and for the check to often be false, you could check again with an acquire load, or put an acquire fence inside the if so it doesn't happen on the no-new-work path.
Use a Seq Lock
Your problem is better solved by using your sequence counter as part of a Seq Lock, so neither reader nor writer needs a mutex. (Summary: increment before writing, then touch the payload, then increment again. In the reader, read i, j, and k into local temporaries, then check the sequence number again to make sure it's the same, and an even number. With appropriate memory barriers.
See the wikipedia article and/or link below for actual details, but the real change from what you have now is that the sequence number has to increment by 2. If you can't handle that, use a separate counter for the actual lock, with seq as part of the payload.)
If you don't want to use a mutex in the reader, using one in the writer only helps in terms of implementation-detail side-effects, like making sure stores to memory actually happen, not keeping i in a register across calls if do_work inlines into some caller.
BTW, updating seq doesn't need to be an atomic RMW if there's only one writer. You can relaxed load and separately store an incremented temporary (with release semantics).
A Seq Lock is good for cheap reads and occasional writes that make the reader retry. Implementing 64 bit atomic counter with 32 bit atomics shows appropriate fencing.
It relies on non-atomic reads that may have a data race, but not using the result if your sequence counter detects tearing. C++ doesn't define the behaviour in that case, but it works in practice on real implementations. (C++ is mostly keeping its options open in case of hardware race detection, which normal CPUs don't do.)
If you have multiple writers, you'd still use a normal lock to give mutual exclusion between them. or use the sequence counter as a spinlock, as a writer acquires it by making the count odd. Otherwise you just need the sequence counter.
Your global g_s is just to track the latest sequence number the reader has seen? Storing it next to the data defeats some of the purpose/benefit, since it means the reader is writing the same cache line as the writer, assuming that variables declared near each other all end up together. Consider making it static inside the function, or separate it with other stuff, or with padding, like alignas(64) or 128. (That wouldn't guarantee that a compiler doesn't put it right before the other vars, though; a struct would let you control the layout of all of them. With enough alignment, you can make sure they're not in the same aligned pair of cache lines.)
Even ignoring the staleness, this is causes a data race and UB.
Thread 2 can read i,j,k while thread 1 is modifying them, you don't synchronize the access to those variables. If thread 2 doesn't respect the g, there's no point in locking it in thread 1.
Yes, it can.
First of all, the lock guard does not have any effect on your code. A lock has to be used by at least two threads to have any effect.
Thread 2 can read at any moment. It can read an incremented i and not incremented j and k. In theory, it can even read a weird partial value obtained by reading in between updating the various bytes that compose i - for example incrementing from 0xFF to 0x100 results reading 0x1FF or 0x0 - but not on x86 where these updates happen to be atomic.

Largest "atomic" Type?

When working in a concurrent, parallel programming language with multiple threads working together on multiple cores and/or multiple sockets, what is the largest value in memory considered to be atomic?
What I mean is: a string, being a sequence of bytes, is decidedly not atomic because a write to that location in memory may take some time to update the string. Therefore, a lock of some kind must be acquired when reading and writing to the string so that other threads don't see corrupted, half-finished results. However, a string on the stack is atomic because AFAIK the stack is not a shared area of memory across threads.
Is the largest guaranteed, lockless unit a bit or a byte, or does it depend on the instruction used to write that byte? Is it possible, for instance, for a thread to read an integer while another thread is moving the value bit-by-bit from a register to the stack to shared memory, causing the reader thread to see a half-written value?
I guess I am asking what the largest atomic value is on x86_64 and what the guarantees are.
The largest atomic instruction in x86-64 is lock cmpxchg16b, which reads and writes 16 bytes atomically.
Although it is usually used to atomically update a 16-byte object in memory, it can also be used to atomically read such a value.
To atomically update a value, load rdx:rax with the prior value and rcx:rbx with the new value. The instruction atomically updates the memory location with the new value only if the prior value hasn't changed.
To atomically read a 16-byte value, load rdx:rax and rcx:rbx with the same value. (It doesn't matter what value, but 0 is a good choice.) The instruction atomically reads the current value into rdx:rax.

Is x++; threadsafe?

If I update a variable in one thread like this:
receiveCounter++;
and then from another thread I only ever read this variable and write its value to a GUI.
Is that safe? Or could this instruction be interrupted in the middle so the value in receiveCounter is wrong when it is read by another thread? it must be so right since ++ is not atomic, it is several instructions.
I don't care about synchronizing reads and writes, it just needs to be incremented and then update in the GUI but this does not have to happen directly after each other.
What I care about is that the value cannot be wrong. Like the ++ operation being interrupted in the middle so the read value is completely off.
Do I need to lock this variable? I really do not want to since it is update very often. I could solves this by just posting a message to a MAIN thread and copy the value to a Queue (which then would need to be locked but I would not do this on every update) I guess.
But I am interested in the above problem anyway.
If one thread changes the value in a variable and another thread reads that value and the program does not synchronize the accesses it has a data race, and the behavior of the program is undefined. Change the type of receiveCounter to std::atomic<int> (assuming it's an int to begin with)
At its core it is a read-modify-write operation, they are not atomic. There are some processors around that have a dedicated instruction for it. Like Intel/AMD cores, very common, they have an INC instruction.
While that sounds like that could be atomic, since it is a single instruction, it still isn't. The x86/x64 instruction set doesn't have much to do anymore with the way the execution engine is actually implemented. Which executes RISC-like "micro-ops", the INC instruction is translated to multiple micro-ops. It can be made atomic with the LOCK prefix on the instruction. But compilers don't emit it unless they know that an atomic update is desired.
You will thus need to be explicit about it. The C++11 std::atomic<> standard addition is a good way. Your operating system or compiler will have intrinsics for it, usually named something like "Interlocked" or "__built_in".
Simple answer: NO.
because i++; is the same as i = i + 1;, it contains load, math-operation, and saving the value. so it is in general not atomic.
However the really executed operations depend on the instruction set of the CPU and might be atomic, depending on the CPU architecture. But the default is still not atomic.
Generally speaking, it is not thread-safe because the ++ operator consists in one read and one write, the pair of which is not atomic and can be interrupted in between.
Then, it probably also depends on the language/compiler/architecture, but in a typical case the increment operation is probably not thread safe.
(edited)
As for the read and write operations themselves, as long as you are not in 64-bit mode, they should be atomic, so if you don't care about other threads having the value wrong by 1, it may be ok for your case.
No. Increment operation is not atomic, hence not thread-safe.
In your case it is safe to use this operation if you don't care about its value in specific time (if you only read this variable from another thread and not trying to write to it). It will eventually increment receiveCounter's value, you just don't have any guarantees about operations order.
++ is equivalent to i=i+1.
++ is not an atomic operation so its NOT thread safe. Only read and write of primitive variables (except long and double) are atomic and hence thread safe.

Atomic 128-bit Memory Mode Selection

Using gcc, my code has an atomic 128-bit integer than is being written to by one thread, and read concurrently from 31 threads. I don't care about operations on this variable's synchronization with any other memory in my program (i.e. I'm okay with the compiler reordering two writes to two different integers), as long as reads and writes from and to this variable are consistent. I just want the guarantee that the writes to the atomic 128-bit are "eventually" guaranteed to be reflected in the 31 threads reading from that variable.
Is it safe to use a relaxed memory model? What are the gotchas I should look out for?
Relaxed ordering does not guarantee that the value written by the writer thread will ever be visible to any of the reader threads.
It is valid behavior that the readers only ever see the initial value of the variable and none of the changes. However, it is guaranteed that a writer thread always sees at least the changes he himself made to the variable (and possibly, but again not guaranteed, any later change applied by another thread).
In other words: You still get sequential consistency within a single thread, but no consistency whatsoever between different threads.