How to use gcc atomic builtins? - c++

In my program I was trying to update a value from multiple threads. I know how to use mutex (pthread_mutex_lock(), pthread_mutex_unlock()) to do this, but I just learned about gcc atomic builtins so I would like to have a try.
shared value A;
void each_working_thread() {
thread local variable B;
if (is_valid(A,B))
__sync_sub_and_fetch(A,B);
else
throw error;
}
where is_valid() is a const function that return boolean.
Is this correct, or the value of A could be updated by another thread during is_valid() ?

I assume is_valid(A, B) can change depending on A.
In which case, this is not safe - consider the following sequence:
Thread 1 calls is_valid(A, B) and gets true
Thread 2 calls is_valid(A, B) and gets true
Thread 1 calls __sync_sub_and_fetch(A, B) changes A, and the new value of A means is_valid(A, B) is now false.
Thread 2 calls __sync_sub_and_fetch(A, B) even though it shouldn't because is_valid(A, B) is now false.
You may be interested in the compare-exchange operation. In this case you can write something like:
int oldValueOfA, newValueOfA;
do {
oldValueOfA = A;
__sync_synchronize(); // memory barrier; makes sure A doesn't get accessed after this line
if (!is_valid(oldValueOfA, B))
throw error;
newValueOfA = oldValueOfA - B;
} while(!__sync_bool_compare_and_swap(&A, oldValueOfA, newValueOfA));
You can't make the two operations atomic, but you can detect if they weren't atomic and then you can try again.
This works because if A doesn't still contain oldValueOfA, __sync_bool_compare_and_swap does nothing and returns false. Otherwise, it sets it to newValueOfA and returns true. So if it returns true, you know nobody else changed A while you were busy calling is_valid. And if it returns false, then you haven't actually done anything yet, so you're free to go back and try again.
Note that is_valid might be called several times; this is relevant if it has side effects (like printing a message on the screen). In that case you should just use a mutex.

That probably is not thread safe, depending on what exactly is_valid does.
The point of that specific builtin is just to perform an atomic subtraction.
Consider this code
A = A - B;
This is not thread safe because it requires at least 2 atomic steps - the A - B, and then the assignment back to A. The builtin for this solves that one scenario.

Related

C++: __sync_synchronize() still needed with std::atomic?

I've been running into an infrequent but re-occurring race condition.
The program has two threads and uses std::atomic. I'll simplify the critical parts of the code to look like:
std::atomic<uint64_t> b; // flag, initialized to 0
uint64_t data[100]; // shared data, initialized to 0
thread 1 (publishing):
// set various shared variables here, for example
data[5] = 10;
uint64_t a = b.exchange(1); // signal to thread 2 that data is ready
thread 2 (receiving):
if (b.load() != 0) { // signal that data is ready
// read various shared variables here, for example:
uint64_t x = data[5];
// race condition sometimes (x sometimes not consistent)
}
The odd thing is that when I add __sync_synchronize() to each thread, then the race condition goes away. I've seen this happen on two different servers.
i.e. when I change the code to look like the following, then the problem goes away:
thread 1 (publishing):
// set various shared variables here, for example
data[5] = 10;
__sync_synchronize();
uint64_t a = b.exchange(1); // signal to thread 2 that data is ready
thread 2 (receiving):
if (b.load() != 0) { // signal that data is ready
__sync_synchronize();
// read various shared variables here, for example:
uint64_t x = data[5];
}
Why is __sync_synchronize() necessary? It seems redundant as I thought both exchange and load ensured the correct sequential ordering of logic.
Architecture is x86_64 processors, linux, g++ 4.6.2
Whilst it is impossible to say from your simplified code what actually goes on in your actual application, the fact that __sync_synchronize helps, and the fact that this function is a memory barrier tells me that you are writing things in the one thread that the other thread is reading, in a way that isn't atomic.
An example:
thread_1:
object *p = new object;
p->x = 1;
b.exchange(p); /* give pointer p to other thread */
thread_2:
object *p = b.load();
if (p->x == 1) do_stuff();
else error("Huh?");
This may very well trigger the error-path in thread2, because the write to p->x has not actually been completed when thread 2 reads the new pointer value p.
Adding memory barrier, in this case, in the thread_1 code should fix this. Note that for THIS case, a memory barrier in thread_2 will not do anything - it may alter the timing and appear to fix the problem, but it won't be the right thing. You may need memory barriers on both sides still, if you are reading/writing memory that is shared between two threads.
I understand that this may not be precisely what your code is doing, but the concept is the same - __sync_synchronize ensures that memory reads and memory writes have completed for ALL of the instructions before that function call [which isn't a real function call, it will inline a single instruction that waits for any pending memory operations to comlete].
Noteworthy is that operations on std::atomic ONLY affect the actual data stored in the atomic object. Not reads/writes of other data.
Sometimes you also need a "compiler barrier" to avoid the compiler moving stuff from one side of an operation to another:
std::atomic<bool> flag(false);
value = 42;
flag.store(true);
....
another thread:
while(!flag.load());
print(value);
Now, there is a chance that the compiler generates the first form as:
flag.store(true);
value = 42;
Now, that wouldn't be good, would it? std::atomic is guaranteed to be a "compiler barrier", but in other cases, the compiler may well shuffle stuff around in a similar way.

Shared vector Synchronization with an atomic boolean

I am having a shared vector which gets accessed by two threads.
A function from thread A pushs into the vector and a function from thread B swaps the vector completely for processing.
MovetoVec(PInfo* pInfo)
{
while(1)
{
if(GetSwitch())
{
swapBucket->push_back(pInfo);
toggles = true;
break;
}
else if(pInfo->tryMove == 5)
{
delete pInfo;
break;
}
pInfo->tryMove++;
Sleep(25);
}
}
The thread A tries to get atomic boolean toggles to true and pushes into vector.(the above MoveToVec function will be called by many number of threads). The function GetSwitch is defined as
GetSwitch()
{
if(toggles)
{
toggles = false;
return TRUE;
}
else
return FALSE;
}
toggles here is atomic_bool.And the another function from thread B that swaps the vector is
GetClassObj(vector<ProfiledInfo*>* toSwaps)
{
if(GetSwitch())
{
toSwaps->swap(*swapBucket);
toggles = true;
}
}
If GetSwitch returns false then the threadB does nothing. Here i dint use any locking. It works in most of the cases. But some time one of the pInfo objects in swapBucket is NULL. I got to know it is because of poor synchronization.
I followed this type of GetSwitch() logic just to neglect the overhead caused by locking. Should i drop this out and go back to mutex or critical section stuffs?
Your GetSwitch implementation is wrong. It is possible for multiple threads to acquire the switch simultaneously.
An example of such a scenario with just two threads:
Thread 1 | Thread 2
--------------------------|--------------------------
if (toggles) |
| if (toggles)
toggles = false; |
| toggles = false;
The if-test and assignment are not an atomic operation and therefore cannot be used to synchronize threads on their own.
If you want to use an atomic boolean as a means of synchronization, you need to compare and exchange the value in one atomic operation. Luckily, C++ provides such an operation called std::compare_exchange, which is available in a weak and strong flavor (the weak one may spuriously fail but is cheaper when called in a loop).
Using this operation, your GetSwitch method would become:
bool GetSwitch()
{
bool expected = true; // The value we expect 'toggles' to have
bool desired = false; // The value we want 'toggles' to get
// Check if 'toggles' is as expected, and if it is, update it to the desired value
bool result = toggles.compare_exchange_strong(&expected, desired);
// The result of the compare_exchange is true if the value was updated and false if it was not
return result;
}
This will ensure that comparing and updating the value happens atomically.
Note that the C++ standard does not guarantee an atomic boolean to be lock-free. In your case, you could also use std::atomic_flag which is guaranteed to be lock-free by the standard! Carefully read the example though, it works a tad bit different than atomic variables.
Writing lock-free code, as you are attempting to do, is quite complex and error-prone.
My advice would be to write the code with locks first and ensure it is 100% correct. Mutexes are actually surprisingly fast, so performance should be okay in most cases. A good read on lock performance: http://preshing.com/20111118/locks-arent-slow-lock-contention-is
Only once you have profiled your code, and convinced yourself that the locks are impacting performance, you should attempt to write the code lock-free. Then profile again because lock-free code is not necessarily faster.

Is a race condition possible when only one thread writes to a bool variable in c++?

In the following code example, program execution never ends.
It creates a thread which waits for a global bool to be set to true before terminating. There is only one writer and one reader. I believe that the only situation that allows the loop to continue running is if the bool variable is false.
How is it possible that the bool variable ends up in an inconsistent state with just one writer?
#include <iostream>
#include <pthread.h>
#include <unistd.h>
bool done = false;
void * threadfunc1(void *) {
std::cout << "t1:start" << std::endl;
while(!done);
std::cout << "t1:done" << std::endl;
return NULL;
}
int main()
{
pthread_t threads;
pthread_create(&threads, NULL, threadfunc1, NULL);
sleep(1);
done = true;
std::cout << "done set to true" << std::endl;
pthread_exit(NULL);
return 0;
}
There's a problem in the sense that this statement in threadfunc1():
while(!done);
can be implemented by the compiler as something like:
a_register = done;
label:
if (a_register == 0) goto label;
So updates to done will never be seen.
There is really nothing that prevents the compiler from optimizing the while-loop away. Use atomic or a mutex to access the bool from more than one thread. That is the only supported and correct solution. As you are using posix, a mutex would be the right solution in this case.
And don't use volatile. There is a posix standard that states what has to work and volatile is not a solution that has a guaranty to work.
And there is an othere problem: There is no guaranty that your newly created thread every started to run, before you set the flag to false.
For such simple example volatile is enough. But for vast majority of real world situations it is not. Use conditional variable for this task. They look weird at the first glance but actually they are quite logical. On x86 bool IS atomic to read/write (for ARM, probably, not). Also there is an obstacle with vector: it is NOT a vector of bools, it is a bitfield. To write vector from several threads use vector (or bool arr[SIZE]).
Also you don't join with thread, it is wrong.
Race condition means: when two threads are accessing the same object, and at least one of them is a write.
It means you will have two types of racing, write-write conflict and write-read conflict.
Back to your code, you essentially have two threads, one is the main thread, and another one is the one you created with pthread_create.
One of them is a read: while(!done), and one of them is a write: done = true.
You have race condition for sure.
Is a race condition possible when only one thread writes to a bool variable in c++?
Yes. In your case, the main thread is also a thread (i.e. you have one thread writing and one thread reading).
How is it possible that the bool variable ends up in an inconsistent state with just one writer?
The compiler is (should be) an optimizing compiler. It will probably optimize the reading of the done variable, unless you take care to avoid that (use std::atomic<bool> done instead).
its not guaranteed that the assignment to a bool which is one byte is atomic

Do I need to use volatile keyword if I declare a variable between mutexes and return it?

Let's say I have the following function.
std::mutex mutex;
int getNumber()
{
mutex.lock();
int size = someVector.size();
mutex.unlock();
return size;
}
Is this a place to use volatile keyword while declaring size? Will return value optimization or something else break this code if I don't use volatile? The size of someVector can be changed from any of the numerous threads the program have and it is assumed that only one thread (other than modifiers) calls getNumber().
No. But beware that the size may not reflect the actual size AFTER the mutex is released.
Edit:If you need to do some work that relies on size being correct, you will need to wrap that whole task with a mutex.
You haven't mentioned what the type of the mutex variable is, but assuming it is an std::mutex (or something similar meant to guarantee mutual exclusion), the compiler is prevented from performing a lot of optimizations. So you don't need to worry about return value optimization or some other optimization allowing the size() query from being performed outside of the mutex block.
However, as soon as the mutex lock is released, another waiting thread is free to access the vector and possibly mutate it, thus changing the size. Now, the number returned by your function is outdated. As Mats Petersson mentions in his answer, if this is an issue, then the mutex lock needs to be acquired by the caller of getNumber(), and held until the caller is done using the result. This will ensure that the vector's size does not change during the operation.
Explicitly calling mutex::lock followed by mutex::unlock quickly becomes unfeasible for more complicated functions involving exceptions, multiple return statements etc. A much easier alternative is to use std::lock_guard to acquire the mutex lock.
int getNumber()
{
std::lock_guard<std::mutex> l(mutex); // lock is acquired
int size = someVector.size();
return size;
} // lock is released automatically when l goes out of scope
Volatile is a keyword that you use to tell the compiler to literally actually write or read the variable and not to apply any optimizations. Here is an example
int example_function() {
int a;
volatile int b;
a = 1; // this is ignored because nothing reads it before it is assigned again
a = 2; // same here
a = 3; // this is the last one, so a write takes place
b = 1; // b gets written here, because b is volatile
b = 2; // and again
b = 3; // and again
return a + b;
}
What is the real use of this? I've seen it in delay functions (keep the CPU busy for a bit by making it count up to a number) and in systems where several threads might look at the same variable. It can sometimes help a bit with multi-threaded things, but it isn't really a threading thing and is certainly not a silver bullet

Can I switch the test and modification part in wait/signal semaphore?

The classic none-busy-waiting version of wait() and signal() semaphore are implemented as below. In this verson, value can be negative.
//primitive
wait(semaphore* S)
{
S->value--;
if (S->value < 0)
{
add this process to S->list;
block();
}
}
//primitive
signal(semaphore* S)
{
S->value++;
if (S->value <= 0)
{
remove a process P from S->list;
wakeup(P);
}
}
Question: Is the following version also correct? Here I test first and modify the value. It's great if you can show me a scenario where it doesn't work.
//primitive wait().
//If (S->value > 0), the whole function is atomic
//otherise, only if(){} section is atomic
wait(semaphore* S)
{
if (S->value <= 0)
{
add this process to S->list;
block();
}
// here I decrement the value after the previous test and possible blocking
S->value--;
}
//similar to wait()
signal(semaphore* S)
{
if (S->list is not empty)
{
remove a process P from S->list;
wakeup(P);
}
// here I increment the value after the previous test and possible waking up
S->value++;
}
Edit:
My motivation is to figure out whether I can use this latter version to achieve mutual exclusion, and no deadlock, no starvation.
Your modified version introduces a race condition:
Thread A: if(S->Value < 0) // Value = 1
Thread B: if(S->Value < 0) // Value = 1
Thread A: S->Value--; // Value = 0
Thread B: S->Value--; // Value = -1
Both threads have acquired a count=1 semaphore. Oops. Note that there's another problem even if they're non-preemptible (see below), but for completeness, here's a discussion on atomicity and how real locking protocols work.
When working with protocols like this, it's very important to nail down exactly what atomic primitives you are using. Atomic primitives are such that they seem to execute instantaneously, without being interleaved with any other operations. You cannot just take a big function and call it atomic; you have to make it atomic somehow, using other atomic primitives.
Most CPUs offer a primitive called 'atomic compare and exchange'. I'll abbreviate it cmpxchg from here on. The semantics are like so:
bool cmpxchg(long *ptr, long old, long new) {
if (*ptr == old) {
*ptr = new;
return true;
} else {
return false;
}
}
cmpxchg is not implemented with this code. It is in the CPU hardware, but behaves a bit like this, only atomically.
Now, let's add to this some additional helpful functions (built out of other primitives):
add_waitqueue(waitqueue) - Sets our process state to sleeping and adds us to a wait queue, but continues executing (ATOMIC)
schedule() - Switch threads. If we're in a sleeping state, we don't run again until awakened (BLOCKING)
remove_waitqueue(waitqueue) - removes our process from a wait queue, then sets our state to awakened if it isn't already (ATOMIC)
memory_barrier() - ensures that any reads/writes logically before this point actually are performed before this point, avoiding nasty memory ordering issues (we'll assume all other atomic primitives come with a free memory barrier, although this isn't always true) (CPU/COMPILER PRIMITIVE)
Here's how a typical semaphore acquisition routine will look. It's a bit more complex than your example, because I've explicitly nailed down what atomic operations I'm using:
void sem_down(sem *pSem)
{
while (1) {
long spec_count = pSem->count;
read_memory_barrier(); // make sure spec_count doesn't start changing on us! pSem->count may keep changing though
if (spec_count > 0)
{
if (cmpxchg(&pSem->count, spec_count, spec_count - 1)) // ATOMIC
return; // got the semaphore without blocking
else
continue; // count is stale, try again
} else { // semaphore count is zero
add_waitqueue(pSem->wqueue); // ATOMIC
// recheck the semaphore count, now that we're in the waitqueue - it may have changed
if (pSem->count == 0) schedule(); // NOT ATOMIC
remove_waitqueue(pSem->wqueue); // ATOMIC
// loop around again to try to acquire the semaphore
}
}
}
You'll note that the actual test for a non-zero pSem->count, in a real-world semaphore_down function, is accomplished by cmpxchg. You can't trust any other read; the value can change an instant after you read the value. We simply can't separate the value check and the value modification.
The spec_count here is speculative. This is important. I'm essentially making a guess at what the count will be. It's a pretty good guess, but it's a guess. cmpxchg will fail if my guess is wrong, at which point the routine has to loop and try again. If I guess 0, then I will either be woken up (as it ceases to be zero while I'm on the waitqueue), or I will notice it's not zero anymore in the schedule test.
You should also note that there is no possible way to make a function that contains a blocking operation atomic. It's nonsensical. Atomic functions, by definition, appear to execute instantaneously, not interleaved with anything else whatsoever. But a blocking function, by definition, waits for something else to happen. This is inconsistent. Likewise, no atomic operation can be 'split up' across a blocking operation, which it is in your example.
Now, you could do away with a lot of this complexity by declaring the function non-preemptable. By using locks or other methods, you simply ensure only one thread is ever running (not including blocking of course) in the semaphore code at a time. But a problem still remains then. Start with a value of 0, where C has taken the semaphore down twice, then:
Thread A: if (S->Value < 0) // Value = 0
Thread A: Block....
Thread B: if (S->Value < 0) // Value = 0
Thread B: Block....
Thread C: S->Value++ // value = 1
Thread C: Wakeup(A)
(Thread C calls signal() again)
Thread C: S->Value++ // value = 2
Thread C: Wakeup(B)
(Thread C calls wait())
Thread C: if (S->Value <= 0) // Value = 2
Thread C: S->Value-- // Value = 1
// A and B have been woken
Thread A: S->Value-- // Value = 0
Thread B: S->Value-- // Value = -1
You could probably fix this with a loop to recheck S->value - again, assuming you are on a single processor machine and your semaphore code is preemptable. Unfortunately, these assumptions are false on all desktop OSes :)
For more discussion on how real locking protocols work, you might be interested in the paper "Fuss, Futexes and Furwocks: Fast Userlevel Locking in Linux"