Reason for C++ IntAtomicGet, GotW - c++

In the GotW article #45, Herb states the following:
void String::AboutToModify(
size_t n,
bool bMarkUnshareable /* = false */
) {
if( data_->refs > 1 && data_->refs != Unshareable ) {
/* ... etc. ... */
This if-condition is not thread-safe. For one thing, evaluating even "data_->refs > 1" may not be atomic; if so, it's possible that if thread 1 tries to evaluate "data_->refs > 1" while thread 2 is updating the value of refs, the value read from data_->refs might be anything -- 1, 2, or even something that's neither the original value nor the new value.
Additionally, he points out that data_->refs may be modified in between comparing with 1 and comparing with Unshareable.
Further down, we find a solution:
void String::AboutToModify(
size_t n,
bool bMarkUnshareable /* = false */
) {
int refs = IntAtomicGet( data_->refs );
if( refs > 1 && refs != Unshareable ) {
/* ... etc. ...*/
Now, I understand that the same refs is used for both comparisons, solving problem 2. But why the IntAtomicGet? I have turned up nothing in searches on the topic - all atomic operations focus on Read, Modify, Write operations, and here we just have a read. So can we just do...
int refs = data_->refs;
...which should probably just be one instruction in the end anyway?

Different platforms make different promises about atomicity of read/write operations. x86 for example guarantees that reading a double word (4 bytes) will be an atomic operation. However, you cannot assume that this will be true for any architecture and it will probably not be.
If you plan to port your code for different platforms, such assumptions could put you in troubles and lead to strange race conditions in your code. Therefore, it's better to protect yourself and make read/write operations explicitly atomic.

Reading from shared memory (data_->refs) while another thread writes to it is the definition of a data race.
What happens when we non-atomically read from data_->refs while another thread is trying to write to it at the same time?
Imagine that thread A is doing ++data_->refs (write) while thread B is doing int x = data_->refs (read). Imagine that thread B reads the first few bytes from data_->refs and that thread A finishes writing its value to data_->refs before thread B is done reading. Thread B then reads the rest of the bytes at data_->refs.
You will get neither the original value, nor the new value; you will get a completely different value! This scenario is just to illustrate what is meant by:
[...] the value read from data_->refs might be anything -- 1, 2, or
even something that's neither the original value nor the new value.
The purpose of atomic operations is to ensure that an operation is indivisible: it is either observed as done or not done. Therefore, we use an atomic read operation to ensure that we get the value of data_->refs either before it is updated, or after (this depends on thread timings).

Related

Acqrel memory order with 3 threads

Lately the more I read about memory order in C++, the more confusing it gets. Hope you can help me clarify this (for purely theoretic purposes). Suppose I have the following code:
std::atomic<int> val = { 0 };
std::atomic<bool> f1 = { false };
std::atomic<bool> f2 = { false };
void thread_1() {
f1.store(true, std::memory_order_relaxed);
int v = 0;
while (!val.compare_exchange_weak(v, v | 1,
std::memory_order_release));
}
void thread_2() {
f2.store(true, std::memory_order_relaxed);
int v = 0;
while (!val.compare_exchange_weak(v, v | 2,
std::memory_order_release));
}
void thread_3() {
auto v = val.load(std::memory_order_acquire);
if (v & 1) assert(f1.load(std::memory_order_relaxed));
if (v & 2) assert(f2.load(std::memory_order_relaxed));
}
The question is: can any of the assertions be false? On one hand, cppreference claims, std::memory_order_release forbids the reordering of both stores after exchanges in threads 1-2 and std::memory_order_acquire in thread 3 forbids both reads to be reordered before the first load. Thus, if thread 3 saw the first or the second bit set that means that the store to the corresponding boolean already happened and it has to be true.
On the other hand, thread 3 synchronizes with whoever released the value it has acquired from val. Can it happen so (in theory if not in practice) that thread 3 "acquired" the exchange "1 -> 3" by thread 2 (and therefore f2 load returns true), but not the "0 -> 1" by thread 1 (thus the first assertion fires)? This possibility makes no sense to me considering the "reordering" understanding, yet I can't find any confirmation that this cannot happen anywhere.
Neither assertion can ever fail, thanks to ISO C++'s "release sequence" rules. This is the formalism that provides the guarantee you assumed must exist in your last paragraph.
The only stores to val are release-stores with the appropriate bits set, done after the corresponding store to f1 or f2. So if thread_3 sees a value with 1 bit set, it has definitely synchronized-with the writer that set the corresponding variable.
And crucially, they're each part of an RMW, and thus form a release-sequence that lets the acquire load in thread_3 synchronize-with both CAS writes, if it happens to see val == 3.
(Even a relaxed RMW can be part of a release-sequence, although in that case there wouldn't be a happens-before guarantee for stuff before the relaxed RMW, only for other release operations by this or other threads on this atomic variable. If thread_2 had used mo_relaxed, the assert on f2 could fail, but it still couldn't break things so the assert on f1 could ever fail. See also What does "release sequence" mean? and https://en.cppreference.com/w/cpp/atomic/memory_order)
If it helps, I think those CAS loops are fully equivalent to val.fetch_or(1, release). Definitely that's how a compiler would implement fetch_or on a machine with CAS but not an atomic OR primitive. IIRC, in the ISO C++ model, CAS failure is only a load, not an RMW. Not that it matters; a relaxed no-op RMW would still propagate a release-sequence.
(Fun fact: x86 asm lock cmpxchg is always a real RMW, even on failure, at least on paper. But it's also a full barrier, so basically irrelevant to any reasoning about weakly-ordered RMWs.)

Does this C++ sample code contains Data Race?

Suppose there are not compiler reorderings.
int32_t value;
int32_t flag = 0;
// thread 1
void UpdateValue(int32_t x) {
value = x;
flag = 1;
}
// thread 2
void DoSomething() {
while (flag == 0);
do_something(value);
}
According to https://en.cppreference.com/w/cpp/language/memory_model, evaluation flag = 1 and evaluation flag == 0 conflict.
And:
flag is not atomic variable
there is no signal handler
flag = 1 doesn't happens before flag == 0
So there is data race?
But in this sample code, every read/write is atomic(4 bytes aligned).
I don't find any undefined behavior and I'm confused...
the data race here is a UB and you can expect any behavior including the one that you expect.
The order different threads read/write that location may
help to understand why it is a UB:
std::memory_order specifies how memory accesses, including regular,
non-atomic memory accesses, are to be ordered around an atomic
operation. Absent any constraints on a multi-core system, when
multiple threads simultaneously read and write to several variables,
one thread can observe the values change in an order different from
the order another thread wrote them. Indeed, the apparent order of
changes can even differ among multiple reader threads. Some similar
effects can occur even on uniprocessor systems due to compiler
transformations allowed by the memory model.

Why does this cppreference excerpt seem to wrongly suggest that atomics can protect critical sections?

int main() {
std::vector<int> foo;
std::atomic<int> bar{0};
std::mutex mx;
auto job = [&] {
int asdf = bar.load();
// std::lock_guard lg(mx);
foo.emplace_back(1);
bar.store(foo.size());
};
std::thread t1(job);
std::thread t2(job);
t1.join();
t2.join();
}
This obviously is not guaranteed to work, but works with a mutex. But how can that be explained in terms of the formal definitions of the standard?
Consider this excerpt from cppreference:
If an atomic store in thread A is tagged memory_order_release and an
atomic load in thread B from the same variable is tagged
memory_order_acquire [as is the case with default atomics], all memory writes (non-atomic and relaxed
atomic) that happened-before the atomic store from the point of view
of thread A, become visible side-effects in thread B. That is, once
the atomic load is completed, thread B is guaranteed to see everything
thread A wrote to memory.
Atomic loads and stores (with the default or with the specific acquire and release memory order specified) have the mentioned acquire-release semantics. (So does a mutex's lock and unlock.)
An interpretation of that wording could be that when Thread 2's load operation syncs with the store operation of Thread1, it is guaranteed to observe all (even non-atomic) writes that happened-before the store, such as the vector-modification, making this well-defined. But pretty much everyone would agree that this can lead to a segmentation fault and would surely do so if the job function ran its three lines in a loop.
What standard wording explains the obvious difference in capability between the two tools, given that this wording seems to imply that atomic would synchronize in a way.
I know when to use mutexes and atomics, and I know that the example doesn't work because no synchronization actually happens. My question is how the definition is to be interpreted so it doesn't contradict the way it works in reality.
The quoted passage means that when B loads the value that A stored, then by observing that the store happened, it can also be assured that everything that B did before the store has also happened and is visible.
But this doesn't tell you anything if the store has not in fact happened yet!
The actual C++ standard says this more explicitly. (Always remember that cppreference, while a valuable resource which often quotes from or paraphrases the standard, is not the standard itself and is not authoritative.) From N4861, the final C++20 draft, we have in atomics.order p2:
An atomic operation A that performs a release operation on an atomic object M synchronizes with an atomic
operation B that performs an acquire operation on M and takes its value from any side effect in the release
sequence headed by A.
I would agree that if the load in your thread B returned 1, it could safely conclude that the other thread had finished its store and therefore had exited the critical section, and therefore B could safely use foo. In this case the load in B has synchronized with the store in A, since the value of the load (namely 1) came from the store (which is part of its own release sequence).
But it is entirely possible that both loads return 0, if both threads do their loads before either one does its store. The value 0 didn't come from either store, so the loads don't synchronize with the stores in that case. Your code doesn't even look at the value that was loaded, so both threads may enter the critical section together in that case.
The following code would be a safe, though inefficient, way to use an atomic to protect a critical section. It ensures that A will execute the critical section first, and B will wait until A has finished before proceeding. (Obviously if both threads wait for the other then you have a deadlock.)
int main() {
std::vector<int> foo;
std::atomic<int> bar{0};
std::mutex mx;
auto jobA = [&] {
foo.emplace_back(1);
bar.store(foo.size());
};
auto jobB = [&] {
while (bar.load() == 0) /* spin */ ;
foo.emplace_back(1);
};
std::thread t1(jobA);
std::thread t2(jobB);
t1.join();
t2.join();
}
Setting aside the elephant in the room that none of the C++ containers are thread safe without employing locking of some sort (so forget about using emplace_back without implementing locking), and focusing on the question of why atomic objects alone are not sufficient:
You need more than atomic objects. You also need sequencing.
All that an atomic object gives you is that when an object changes state, any other thread will either see its old value or its new value, and it will never see any "partially old/partially new", or "intermediate" value.
But it makes no guarantee whatsoever as to when other execution threads will "see" the atomic object's new value. At some point they (hopefully) will, see the atomic object's instantly flip to its new value. When? Eventually. That's all that you get from atomics.
One execution thread may very well set an atomic object to a new value, but other execution threads will still have the old value cached, in some form or fashion, and will continue to see the atomic object's old value, and won't "see" the atomic object's new value until some intermediate time passes (if ever).
Sequencing are rules that specify when objects' new values are visible in other execution threads. The simplest way to get both atomicity and easy to deal with sequencing, in one fell swoop, is to use mutexes and condition variables which handle all the hard details for you. You can still use atomics and with a careful logic use lock/release fence instructions to implement proper sequencing. But it's very easy to get it wrong, and the worst of it you won't know that it's wrong until your code starts going off the rails due to improper sequencing and it'll be nearly impossible to accurately reproduce the faulty behavior for debugging purposes.
But for nearly all common, routine, garden-variety tasks mutexes and condition variables is the most simplest solution to proper inter-thread sequencing.
The idea is that when Thread 2's load operation syncs with the store operation of Thread1, it is guaranteed to observe all (even non-atomic) writes that happened-before the store, such as the vector-modification
Yes all writes that done by foo.emplace_back(1); would be guaranteed when bar.store(foo.size()); is executed. But who guaranteed you that foo.emplace_back(1); from thread 1 would see any/all non partial consistent state from foo.emplace_back(1); executed in thread 2 and vice versa? They both read and modify internal state of std::vector and there is no memory barrier before code reaches atomic store. And even if all variables would be read/modified atomically std::vector state consists of multiple variables - size, capacity, pointer to the data at least. Changes to all of them must be synchronized as well and memory barrier is not enough for that.
To explain little more let's create simplified example:
int a = 0;
int b = 0;
std::atomic<int> at;
// thread 1
int foo = at.load();
a = 1;
b = 2;
at.store(foo);
// thread 2
int foo = at.load();
int tmp1 = a;
int tmp2 = b;
at.store(tmp2);
Now you have 2 problems:
There is no guarantee that when tmp2 value is 2
tmp1 value would be 1
as you read a and b before atomic operation.
There is no guarantee that when at.store(b)
is executed that either a == b == 0 or a == 1 and b == 2,
it could be a == 1 but still b == 0.
Is that clear?
But:
// thread 1
mutex.lock();
a = 1;
b = 2;
mutex.unlock();
// thread 2
mutex.lock();
int tmp1 = a;
int tmp2 = b;
mutex.unlock();
You either get tmp1 == 0 and tmp2 == 0 or tmp1 == 1 and tmp2 == 2, do you see the difference?

about spin lock

i have some questions in boost spinlock code :
class spinlock
{
public:
spinlock()
: v_(0)
{
}
bool try_lock()
{
long r = InterlockedExchange(&v_, 1);
_ReadWriteBarrier(); // 1. what this mean
return r == 0;
}
void lock()
{
for (unsigned k = 0; !try_lock(); ++k)
{
yield(k);
}
}
void unlock()
{
_ReadWriteBarrier();
*const_cast<long volatile*>(&v_) = 0;
// 2. Why don't need to use InterlockedExchange(&v_, 0);
}
private:
long v_;
};
A ReadWriteBarrier() is a "memory barrier" (in this case for both reads and writes), a special instruction to the processor to ensure that any instructions resulting in memory operations have completed (load & store operations - or in for example x86 processors, any opertion which has a memory operand at either side). In this particular case, to make sure that the InterlockedExchange(&v_,1) has completed before we continue.
Because an InterlockedExchange would be less efficient (takes more interaction with any other cores in the machine to ensure all other processor cores have 'let go' of the value - which makes no sense, since most likely (in correctly working code) we only unlock if we actually hold the lock, so no other processor will have a different value cached than what we're writing over anyway), and a volatile write to the memory will be just as good.
The barriers are there to ensure memory synchronization; without
them, different threads may see modifications of memory in
different orders.
And the InterlockedExchange isn't necessary in the second case
because we're not interested in the previous value. The role of
InterlockedExchange is doubtlessly to set the value and return
the previous value. (And why v_ would be long, when it can
only take values 0 and 1, is beyond me.)
There are three issues with atomic access to variables. First, ensuring that there is no thread switch in the middle of reading or writing a value; if this happens it's called "tearing"; the second thread can see a partly written value, which will usually be nonsensical. Second, ensuring that all processors see the change that is being made with a write, or that the processor reading a value sees any previous changes to that value; this is called "cache coherency". Third, ensuring that the compiler doesn't move code across the read or write; this is called "code motion". InterlockedExchange does the first two; although the MSDN documentation is rather muddled, _ReadWriteBarrier does the third, and possibly the second.

Does std::atomic prevent reordering of nonatomic variables over the atomic variables

question is rather simple Q:
If I have
settings[N_STNGS];//used by many threads
std::atomic<size_t> current_settings(0);
void updateSettings()//called by single thread , always the same thread if that is important
{
auto new_settings = (current_settings+1)%N_STNGS;
settings[new_settings].loadFromFileSystem(); //line A
current_settings=new_settings; //line B
}
does standard guarantee that line A wont be reordered after line B? Also will users of STNGS always see consistent(commited-as in memory visibility visible) data?
Edit: for multiple reader threads and nontrivial settings is this worth the trouble compared to simple mutexing?
Given the definition
int settings[N_STNGS];
std::atomic<size_t> current_settings(0);
and Thread 1 executing:
settings[new_settings] = somevalue; // line A
current_settings=new_settings; // line B
and Thread 2 executing:
int cur_settings = current_settings; // line X
int setting_value = settings[cur_settings]; // line Y
then yes, if Thread 2 at line X reads new_settings written by Thread 1 in line B, and there are no other modifications to settings[new_settings] (by some code we don't see), Thread 2 is bound to read somevalue and no undefined behavior occurs. This is because all the operations are (by default) memory_order_seq_cst and a release-write (line B) synchronizes with an acquire-read (line X). Note that you need two statements in Thread 2 to get a sequenced-before relationship between the atomic read of the index and the read of the value (a memory_order_consume operation would do instead).
I'd certainly implement it with rw-mutexes for start.
The general answer is no. If you are careful and you use only functions which have a memory_order parameter and pass them the right value for it depending on what you are doing, then it may be yes.
(And as other have pointed out, your code has problems. For instance, returning by value an atomic<> type doesn't make sense for me.)