non-destructive atomic add?

non-destructive atomic add? - c++

If I have an atomic variable like so:
#include <atomic>
std::atomic<int> a = 5;
I'd like to atomically check whether (a + 4) is less than another variable, without over-writing the original value of a:
if(a.something(4) < another_variable){
//Do not want a to be incremented by 4 at this point
}
I did a quick test on atomic fetch_and_add() and ++ and they all seem to increase the value of variable a afterwards. Is there a way I can atomically increment to test, without the result being permanent?

if(a + 4 < another_variable) // ...
This is the best you can get with a single atomic. You are data-race free, as the reading of the atomic is safe against concurrent writes, and all subsequent operations happen on a copy of the original atomic value. A more verbose but functionally equivalent version would be:
int const copy_of_a = a.load();
if(copy_of_a + 4 < another_variable) // ...
This is also the best you can get in terms of synchronization. You may be worried about the fact that a may be changed on another thread to a value that will change the outcome of the if.
Assume there was a function that did the whole operation atomically:
if(a.plus4IsLessThan(another_variable) // ...
Then whether a concurrent change of a arrives in time to change the outcome of the test is still not known. You did not gain any additional guarantees in terms of synchronization.
If this is a problem for your program, it indicates that you are in need of a more powerful synchronization mechanism. Probably a std::mutex would be a good start.

You can just do:
if (a + 4 < another_variable) { ...
Which should be identical to:
if (a.load() + 4 < another_variable) { ...
By definition (§29.6.5/16-17, here A "refers to one of the atomic types" and "C refers to its corresponding non-atomic type"):
A::operator C() const volatile noexcept;
A::operator C() const noexcept;
Effects: load()
Returns: The result of load()
Neither of which modify a.

Related

Why does this cppreference excerpt seem to wrongly suggest that atomics can protect critical sections?

int main() {
std::vector<int> foo;
std::atomic<int> bar{0};
std::mutex mx;
auto job = [&] {
int asdf = bar.load();
// std::lock_guard lg(mx);
foo.emplace_back(1);
bar.store(foo.size());
};
std::thread t1(job);
std::thread t2(job);
t1.join();
t2.join();
}
This obviously is not guaranteed to work, but works with a mutex. But how can that be explained in terms of the formal definitions of the standard?
Consider this excerpt from cppreference:
If an atomic store in thread A is tagged memory_order_release and an
atomic load in thread B from the same variable is tagged
memory_order_acquire [as is the case with default atomics], all memory writes (non-atomic and relaxed
atomic) that happened-before the atomic store from the point of view
of thread A, become visible side-effects in thread B. That is, once
the atomic load is completed, thread B is guaranteed to see everything
thread A wrote to memory.
Atomic loads and stores (with the default or with the specific acquire and release memory order specified) have the mentioned acquire-release semantics. (So does a mutex's lock and unlock.)
An interpretation of that wording could be that when Thread 2's load operation syncs with the store operation of Thread1, it is guaranteed to observe all (even non-atomic) writes that happened-before the store, such as the vector-modification, making this well-defined. But pretty much everyone would agree that this can lead to a segmentation fault and would surely do so if the job function ran its three lines in a loop.
What standard wording explains the obvious difference in capability between the two tools, given that this wording seems to imply that atomic would synchronize in a way.
I know when to use mutexes and atomics, and I know that the example doesn't work because no synchronization actually happens. My question is how the definition is to be interpreted so it doesn't contradict the way it works in reality.

The quoted passage means that when B loads the value that A stored, then by observing that the store happened, it can also be assured that everything that B did before the store has also happened and is visible.
But this doesn't tell you anything if the store has not in fact happened yet!
The actual C++ standard says this more explicitly. (Always remember that cppreference, while a valuable resource which often quotes from or paraphrases the standard, is not the standard itself and is not authoritative.) From N4861, the final C++20 draft, we have in atomics.order p2:
An atomic operation A that performs a release operation on an atomic object M synchronizes with an atomic
operation B that performs an acquire operation on M and takes its value from any side effect in the release
sequence headed by A.
I would agree that if the load in your thread B returned 1, it could safely conclude that the other thread had finished its store and therefore had exited the critical section, and therefore B could safely use foo. In this case the load in B has synchronized with the store in A, since the value of the load (namely 1) came from the store (which is part of its own release sequence).
But it is entirely possible that both loads return 0, if both threads do their loads before either one does its store. The value 0 didn't come from either store, so the loads don't synchronize with the stores in that case. Your code doesn't even look at the value that was loaded, so both threads may enter the critical section together in that case.
The following code would be a safe, though inefficient, way to use an atomic to protect a critical section. It ensures that A will execute the critical section first, and B will wait until A has finished before proceeding. (Obviously if both threads wait for the other then you have a deadlock.)
int main() {
std::vector<int> foo;
std::atomic<int> bar{0};
std::mutex mx;
auto jobA = [&] {
foo.emplace_back(1);
bar.store(foo.size());
};
auto jobB = [&] {
while (bar.load() == 0) /* spin */ ;
foo.emplace_back(1);
};
std::thread t1(jobA);
std::thread t2(jobB);
t1.join();
t2.join();
}

Setting aside the elephant in the room that none of the C++ containers are thread safe without employing locking of some sort (so forget about using emplace_back without implementing locking), and focusing on the question of why atomic objects alone are not sufficient:
You need more than atomic objects. You also need sequencing.
All that an atomic object gives you is that when an object changes state, any other thread will either see its old value or its new value, and it will never see any "partially old/partially new", or "intermediate" value.
But it makes no guarantee whatsoever as to when other execution threads will "see" the atomic object's new value. At some point they (hopefully) will, see the atomic object's instantly flip to its new value. When? Eventually. That's all that you get from atomics.
One execution thread may very well set an atomic object to a new value, but other execution threads will still have the old value cached, in some form or fashion, and will continue to see the atomic object's old value, and won't "see" the atomic object's new value until some intermediate time passes (if ever).
Sequencing are rules that specify when objects' new values are visible in other execution threads. The simplest way to get both atomicity and easy to deal with sequencing, in one fell swoop, is to use mutexes and condition variables which handle all the hard details for you. You can still use atomics and with a careful logic use lock/release fence instructions to implement proper sequencing. But it's very easy to get it wrong, and the worst of it you won't know that it's wrong until your code starts going off the rails due to improper sequencing and it'll be nearly impossible to accurately reproduce the faulty behavior for debugging purposes.
But for nearly all common, routine, garden-variety tasks mutexes and condition variables is the most simplest solution to proper inter-thread sequencing.

The idea is that when Thread 2's load operation syncs with the store operation of Thread1, it is guaranteed to observe all (even non-atomic) writes that happened-before the store, such as the vector-modification
Yes all writes that done by foo.emplace_back(1); would be guaranteed when bar.store(foo.size()); is executed. But who guaranteed you that foo.emplace_back(1); from thread 1 would see any/all non partial consistent state from foo.emplace_back(1); executed in thread 2 and vice versa? They both read and modify internal state of std::vector and there is no memory barrier before code reaches atomic store. And even if all variables would be read/modified atomically std::vector state consists of multiple variables - size, capacity, pointer to the data at least. Changes to all of them must be synchronized as well and memory barrier is not enough for that.
To explain little more let's create simplified example:
int a = 0;
int b = 0;
std::atomic<int> at;
// thread 1
int foo = at.load();
a = 1;
b = 2;
at.store(foo);
// thread 2
int foo = at.load();
int tmp1 = a;
int tmp2 = b;
at.store(tmp2);
Now you have 2 problems:
There is no guarantee that when tmp2 value is 2
tmp1 value would be 1
as you read a and b before atomic operation.
There is no guarantee that when at.store(b)
is executed that either a == b == 0 or a == 1 and b == 2,
it could be a == 1 but still b == 0.
Is that clear?
But:
// thread 1
mutex.lock();
a = 1;
b = 2;
mutex.unlock();
// thread 2
mutex.lock();
int tmp1 = a;
int tmp2 = b;
mutex.unlock();
You either get tmp1 == 0 and tmp2 == 0 or tmp1 == 1 and tmp2 == 2, do you see the difference?

Difference between Interlocked, InterlockedAcquire, and InterlockedRelease if single thread reordering is impossible

In all likelihood, a lockless implementation is already overkill for the purposes of my application, but I wanted to look into memory barriers and lockless-ness anyways in case I ever actually need to use these concepts in the future.
From what I can tell:
an "InterlockedAcquire" function performs an atomic operation while preventing the compiler from moving code statements after the InterlockedAcquire to before the InterlockedAcquire.
an "InterlockedRelease" function performs an atomic operation while preventing the compiler from moving code statements before the InterlockedRelease to after the InterlockedRelease.
a vanilla "Interlocked" function performs an atomic operation while preventing the compiler from moving code statements in either direction across the Interlocked call.
My question is, if a function is structured such that the compiler can't reorder any of the code anyways because doing so would affect single-threaded behavior, is there a difference between any of the variants of an Interlocked function, or all they all effectively the same? Is the only difference between them how they interact with code reordering?
For a more concrete example, here's my current application - the produce() function as part of what will eventually be a multiple producer, single consumer queue built using a circular buffer:
template <typename T>
class Queue {
private:
long headIndex;
long tailIndex;
T* array[MAXQUEUESIZE];
public:
Queue() {
headIndex = 0;
tailIndex = 0;
memset(array, 0, MAXQUEUESIZE*sizeof(void*);
}
~Queue() {
}
bool produce(T value) {
//1) prevents concurrent calls to produce() from causing corruption:
long indexRetVal;
long reservedIndex;
do {
reservedIndex = tailIndex;
indexRetVal = InterlockedCompareExchange64(&tailIndex, (reservedIndex + 1) % MAXQUEUESIZE, reservedIndex);
} while (indexRetVal != reservedIndex);
//2) allocates the node.
T* newValPtr = (T*) malloc(sizeof(T));
if (newValPtr == null) {
OutputDebugString("Queue: malloc returned null");
return false;
}
*newValPtr = value;
//3) prevents a concurrent call to consume from causing corruption by atomically replacing the old pointer:
T* valPtrRetVal = InterlockedCompareExchangePointer(array + reservedIndex, newValPtr, null);
//if the previous value wasn't null, then our circular buffer overflowed:
if (valPtrRetVal != null) {
OutputDebugString("Queue: circular buffer overflowed");
free(newValPtr); //as pointed out by RbMm
return false;
}
//otherwise, everything worked fine
return true;
}
};
As I understand it, 3) will occur after 1) and 2) regardless of what I do anyways, but I should change 1) to an InterlockedRelease because I don't care whether it occurs before or after 2) and I should let the compiler decide.

My question is, if a function is structured such that the compiler can't reorder any of the code anyways because doing so would affect single-threaded behavior, is there a difference between any of the variants of an Interlocked function, or all they all effectively the same? Is the only difference between them how they interact with code reordering?
You may be confusing C++ statements with instructions. Your question isn't CPU specific, so you have to pretend you have no idea what the CPU instructions look like.
Consider this code:
if (a == 2)
{
b = 5;
}
Now, here's an example of a re-ordering of this code that doesn't affect a single thread:
int c = b;
b = 5;
if (a != 2)
b = c;
This performs the same operations but in a different order. It has no effect on single-threaded code. But, of course, if another thread was accessing b, it could see a value of 5 from this code even if a was never 2.
Thus it could also see a value of 5 from the original code even if a is never 2!
Why, because the two bits of code perform the same from the point of view of a single thread. And unless you use operations with guaranteed threading semantics, that's all the compiler, CPU, caches, and other platform components need to preserve.
So most likely, your belief that reordering any of the code would affect single-threaded behavior is probably incorrect. There's lots of ways to reorder and optimize code that doesn't affect single-threaded behavior.

There is an document on the msdn Explained the difference: Acquire and Release Semantics.
For the sample:
a++;
b++;
c++;
If we use acquire semantics to increment a, other processors would always see the increment of a before the increments of b and c;
If we use release semantics to increment c, other processors would always see the increments of a and b before the increment of c;
the InterlockedXxx routines perform, have both acquire and release semantics by default.
More specific, for 4 values:
a++;
b++;
c++;
d++;
If we use acquire semantics to increment b, other processors would always see the increment of b before the increments of c and d;
The order may be a->b->c,d or b->a,c,d.
If we use release semantics to increment c, other processors would always see the increments of a and b before the increment of c;
The order may be a,b->c->d or a,b,d->c.
To quote from this answer of #antiduh:
Acquire says "only worry about stuff after me". Release says "only
worry about stuff before me". Combining those both is a full memory
barrier.

All three versions prevent the compiler from moving code across the function call, but the compiler is not the only place that reordering takes place.
Modern CPUs have "out-of-order execution" and even "speculative execution". Acquire and release semantics cause the code to compiler to instructions with flags or prefixes controlling reordering within the CPU.

Race condition when incrementing and decrementing global variable in C++

I found an example of a race condition that I was able to reproduce under g++ in linux. What I don't understand is how the order of operations matter in this example.
int va = 0;
void fa() {
for (int i = 0; i < 10000; ++i)
++va;
}
void fb() {
for (int i = 0; i < 10000; ++i)
--va;
}
int main() {
std::thread a(fa);
std::thread b(fb);
a.join();
b.join();
std::cout << va;
}
I can undertand that the order matters if I had used va = va + 1; because then RHS va could have changed before getting back to assigned LHS va. Can someone clarify?

The standard says (quoting the latest draft):
[intro.races]
Two expression evaluations conflict if one of them modifies a memory location ([intro.memory]) and the other one reads or modifies the same memory location.
The execution of a program contains a data race if it contains two potentially concurrent conflicting actions, at least one of which is not atomic, and neither happens before the other, except for the special case for signal handlers described below.
Any such data race results in undefined behavior.
Your example program has a data race, and the behaviour of the program is undefined.
What I don't understand is how the order of operations matter in this example.
The order of operations matters because the operations are not atomic, and they read and modify the same memory location.
can undertand that the order matters if I had used va = va + 1; because then RHS va could have changed before getting back to assigned LHS va
The same applies to the increment operator. The abstract machine will:
Read a value from memory
Increment the value
Write a value back to memory
There are multiple steps there that can interleave with operations in the other thread.
Even if there was a single operation per thread, there would be no guarantee of well defined behaviour unless those operations are atomic.
Note outside of the scope of C++: A CPU might have a single instruction for incrementing an integer in memory. For example, x86 has such instruction. It can be invoked both atomically and non-atomically. It would be wasteful for the compiler to use the atomic instruction unless you explicitly use atomic operations in C++.

First of all, this is undefined behaviour since the two threads' reads and writes of the same non-atomic variable va are potentially concurrent and neither happens before the other.
With that being said, if you want to understand what your computer is actually doing when this program is run, it may help to assume that ++va is the same as va = va + 1. In fact, the standard says they are identical, and the compiler will likely compile them identically. Since your program contains UB, the compiler is not required to do anything sensible like using an atomic increment instruction. If you wanted an atomic increment instruction, you should have made va atomic. Similarly, --va is the same as va = va - 1. So in practice, various results are possible.

The important idea here is that when c++ is compiled it is "translated" to assembly language. The translation of ++va or --va will result in assembly code that moves the value of va to a register, then stores the result of adding 1 to that register back to va in a separate instruction. In this way, it is exactly the same as va = va + 1;. It also means that the operation va++ is not necessarily atomic.
See here for an explanation of what the Assembly code for these instructions will look like.
In order to make atomic operations, the variable could use a locking mechanism. You can do this by declaring an atomic variable (which will handle synchronization of threads for you):
std::atomic<int> va;
Reference: https://en.cppreference.com/w/cpp/atomic/atomic

Multithreading: do I need protect my variable in read-only method?

I have few questions about using lock to protect my shared data structure. I am using C/C++/ObjC/Objc++
For example I have a counter class that used in multi-thread environment
class MyCounter {
private:
int counter;
std::mutex m;
public:
int getCount() const {
return counter;
}
void increase() {
std::lock_guard<std::mutex> lk(m);
counter++;
}
};
Do I need to use std::lock_guard<std::mutex> lk(m); in getCount() method to make it thread-safe?
What happen if there is only two threads: a reader thread and a writer thread then do I have to protect it at all? Because there is only one thread is modifying the variable so I think no lost update will happen.
If there are multiple writer/reader for a shared primitive type variable (e.g. int) what disaster may happen if I only lock in write method but not read method? Will 8bits type make any difference compare to 64bits type?
Is any primitive type are atomic by default? For example write to a char is always atomic? (I know this is true in Java but don't know about c++ and I am using llvm compiler on Mac if platform matters)

Yes, unless you can guarantee that changes to the underlying variable counter are atomic, you need the mutex.
Classic example, say counter is a two-byte value that's incremented in (non-atomic) stages:
(a) add 1 to lower byte
if lower byte is 0:
(b) add 1 to upper byte
and the initial value is 255.
If another thread comes in anywhere between the lower byte change a and the upper byte change b, it will read 0 rather than the correct 255 (pre-increment) or 256 (post-increment).
In terms of what data types are atomic, the latest C++ standard defines them in the <atomic> header.
If you don't have C++11 capabilities, then it's down to the implementation what types are atomic.

Yes, you would need to lock the read as well in this case.
There are several alternatives -- a lock is quite heavy here. Atomic operations are the most obvious (lock-free). There are also other approaches to locking in this design -- the read write lock is one example.

Yes, I believe that you do need to lock the read as well. But since you are using C++11 features, why don't you use std::atomic<int> counter; instead?

As a rule of thumb, you should lock the read too.
Read and write to int is atomic on most architecture (and since int is guaranted to be the machine's word size, you should almost never experience corrupted int)
Yet, the answer from #paxdiablo is correct, and will happen if you have someone doing this:
#pragma pack(push, 1)
struct MyObj
{
char a;
MyCounter cnt;
};
#pragma pack(pop)
In that specific case, cnt will not be aligned to a word boundary, and the int MyCounter::counter will/might be emulated in multiple operations in CPU supporting unaligned access (like x86). Thus, you could get this sequence of operations:
Thread A: [...] set counter to 255 (counter is 0x000000FF)
getCount() => CPU reads low byte: lo:255
<interrupted here>
Thread B: increase() => counter is incremented, leading to counter = 256 = 0x00000100)
<interrupted here>
Thread A: CPU read high bytes: 0x000001, concatenate: 0x000001FF, returns 511 !
Now, let's say you never use unaligned access. Yet, if you are doing something like this:
ThreadA.cpp:
int g = clientCounter.getCount();
while (g > 0)
{
processFirstClient();
g = clientCounter.getCount();
}
ThreadB.cpp:
if (acceptClient()) clientCounter.increase();
The compiler is completely allowed to replace the loop in Thread A by this:
if (clientCounter.getCount())
while(true) processFirstClient();
Why ? That's because for each instruction, the compiler will evaluate side-effects of such expression. The getCount() is so simple that the compiler will deduce: it's a read of a single variable, and it's not modified anywhere in ThreadA.cpp, thus, it's constant. Because it's constant, let's simplify this.
If you add a mutex, the mutex code will insert a memory barrier telling the compiler "hey, don't expect anything after this barrier is crossed".
Thus, the "optimization" above can not happen since getCount might have been modified.
Sure, you could have written volatile int counter instead of counter, and the compiler would have avoided this optimization too.
In the end, if you have to write a ton of code just to avoid a mutex, you're doing it wrong (and probably will get wrong results).

You cant gaurantee that multiple threads wont modify your variable at the same time. and if such a situation occurs your variable will be garbled or program might crash. In order to avoid such cases its always better and safer to make the program thread safe.
You can use the synchronization techinques available like: Mutex, Lock, Synchronization attribute(available for MS c++)

Does std::atomic prevent reordering of nonatomic variables over the atomic variables

question is rather simple Q:
If I have
settings[N_STNGS];//used by many threads
std::atomic<size_t> current_settings(0);
void updateSettings()//called by single thread , always the same thread if that is important
{
auto new_settings = (current_settings+1)%N_STNGS;
settings[new_settings].loadFromFileSystem(); //line A
current_settings=new_settings; //line B
}
does standard guarantee that line A wont be reordered after line B? Also will users of STNGS always see consistent(commited-as in memory visibility visible) data?
Edit: for multiple reader threads and nontrivial settings is this worth the trouble compared to simple mutexing?

Given the definition
int settings[N_STNGS];
std::atomic<size_t> current_settings(0);
and Thread 1 executing:
settings[new_settings] = somevalue; // line A
current_settings=new_settings; // line B
and Thread 2 executing:
int cur_settings = current_settings; // line X
int setting_value = settings[cur_settings]; // line Y
then yes, if Thread 2 at line X reads new_settings written by Thread 1 in line B, and there are no other modifications to settings[new_settings] (by some code we don't see), Thread 2 is bound to read somevalue and no undefined behavior occurs. This is because all the operations are (by default) memory_order_seq_cst and a release-write (line B) synchronizes with an acquire-read (line X). Note that you need two statements in Thread 2 to get a sequenced-before relationship between the atomic read of the index and the read of the value (a memory_order_consume operation would do instead).
I'd certainly implement it with rw-mutexes for start.

The general answer is no. If you are careful and you use only functions which have a memory_order parameter and pass them the right value for it depending on what you are doing, then it may be yes.
(And as other have pointed out, your code has problems. For instance, returning by value an atomic<> type doesn't make sense for me.)

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js