Is std::call_once lock free? - c++

I want to find out if the std::call_once lock free. There are call_once implementations using mutex. But why should we use mutex? I tried to write simple implementation using atomic_bool and CAS operation. Is the code thread safe?
#include <iostream>
#include <thread>
#include <atomic>
#include <unistd.h>
using namespace std;
using my_once_flag = atomic<bool>;
void my_call_once(my_once_flag& flag, std::function<void()> foo) {
bool expected = false;
bool res = flag.compare_exchange_strong(expected, true,
std::memory_order_release, std::memory_order_relaxed);
if(res)
foo();
}
my_once_flag flag;
void printOnce() {
usleep(100);
my_call_once(flag, [](){
cout << "test" << endl;
});
}
int main() {
for(int i = 0; i< 500; ++i){
thread([](){
printOnce();
}).detach();
}
return 0;
}

Your proposed implementation is not thread safe. It does guarantee that foo() will only be called once through this code, but it does not guarantee that all threads will see side effects from the call to foo(). Suppose that thread 1 executes the compare and gets true, then the scheduler switches to thread 2, before thread 2 calls foo(). Thread 2 will get false, skip the call to foo(), and move on. Since the call to foo() has not been executed, thread 2 can continue executing before any side effects from foo() have occurred.

The already-called-once fast-path can be wait-free.
gcc's implementation doesn't look all that efficient. I don't know why it isn't implemented the same way as initialization of static local variables with a non-constant arg, which uses a check that's very cheap (but not free!) for the case where it's already initialized.
http://en.cppreference.com/w/cpp/thread/call_once comments that:
Initialization of function-local statics is guaranteed to occur only
once even when called from multiple threads, and may be more efficient
than the equivalent code using std::call_once.
To make an efficient implementation, the std::once_flag could have three states:
execution finished: if you find this state, you're already done.
execution in progress: if you find this: wait until it changes to finished (or changes to failed-with-exception, which which case try to claim it)
execution not started: if you find this, attempt to CAS it to in-progress and call the function. If the CAS failed, some other thread succeeded, so goto the wait-for-finished state.
Checking the flag with an acquire-load is extremely cheap on most architectures (especially x86 where all loads are acquire-loads). Once it's set to "finished", it's not modified for the rest of the program, so it can stay cached in L1 on all cores (unless you put it in the same cache line as something that is frequently modified, creating false sharing).
Even if your implementation worked, it attempts an atomic CAS every time, which is ridiculously more expensive than a load-acquire.
I haven't fully decoded what gcc is doing for call_once, but it unconditionally does a bunch of loads, and two stores to thread-local storage, before checking if a pointer is NULL. (test rax,rax / je). But if it is, it calls std::__throw_system_error(int), so that's not a guard variable that it's using to detect the already-initialized case.
So it looks like it unconditionally calls __gthrw_pthread_once(int*, void (*)()), and checks the return value. So that pretty much sucks for use-cases where you want to cheaply make sure some initialization got done, while avoiding the static-initialization fiasco. (i.e. that your build procedure controls the ordering of constructors for static objects, not anything you put in the code itself.)
So I'd recommend using static int dummy = init_function(); where dummy is either something you actually want to construct, or just a way to call init_function for its side-effects.
Then on the fast-path, the asm from:
int called_once();
void static_local(){
static char dummy = called_once();
(void)dummy;
}
looks like this:
static_local():
movzx eax, BYTE PTR guard variable for static_local()::dummy[rip]
test al, al
je .L18
ret
.L18:
... # code that implements basically what I described above: call or wait
See it on the Godbolt compiler explorer, along with gcc's actual code for std::once_flag.
You could of course implement a guard variable yourself with an atomic uint8_t, which starts out initialized to non-zero, and is set to zero only when the call is complete. Testing for zero may be slightly cheaper on some ISAs, including x86 if the compiler is weird like gcc and decides to actually load it into a register instead of using cmp byte [guard], 0.

Related

c++ atomic: would function call act as memory barrier?

I'm reading this article Memory Ordering at Compile Time from which said:
In fact, the majority of function calls act as compiler barriers,
whether they contain their own compiler barrier or not.This excludes inline functions, functions declared with the pure attribute, and cases where link-time code generation is used. Other than those cases, a call to an external function is even stronger than a compiler barrier, since the compiler has no idea what the function’s side effects will be.
Is this a true statement? Think about this sample -
std::atomic_bool flag = false;
int value = 0;
void th1 () { // running in thread 1
value = 1;
// use atomic & release to prevent above sentence being reordered below
flag.store(true, std::memory_order_release);
}
void th2 () { // running in thread 2
// use atomic & acquire to prevent asset(..) being reordered above
while (!flag.load(std::memory_order_acquire)) {}
assert (value == 1); // should never fail!
}
Then we can remove atomic but replace with function call -
bool flag = false;
int value = 0;
void writeflag () {
flag = true;
}
void readflag () {
while (!flag) {}
}
void th1 () {
value = 1;
writeflag(); // would function call prevent reordering?
}
void th2 () {
readflag(); // would function call prevent reordering?
assert (value == 1); // would this fail???
}
Any idea?
A compiler barrier is not the same thing as a memory barrier. A compiler barrier prevents the compiler from moving code across the barrier. A memory barrier (loosely speaking) prevents the hardware from moving reads and writes across the barrier. For atomics you need both, and you also need to ensure that values don't get torn when read or written.
Formally, no, if only because Link-Time Code Generation is a valid implementation choice and need not be optional.
There's also a second oversight, and that's escape analysis. The claim is that "the compiler has no idea what the function’s side effects will be.", but if no pointers to my local variables escape from my function, then the compiler does know for sure that no other function changes them.
In the second example, even if we assume that no reordering of any kind, the behavior is undefined.
The writes and reads from variable flag are not atomic, and there is a race condition1. Having no reordering doesn't guarantee that both threads don't access the variable flat at the same time. This happens when one thread hits the while loop in the function readflag and reads flag, and the other thread writes to flag in writeflag.
1 (Quoted from: ISO/IEC 14882:2011(E) 1.10 Multi-threaded executions and data races 21)
The execution of a program contains a data race if it contains two conflicting actions in different threads,
at least one of which is not atomic, and neither happens before the other. Any such data race results in
undefined behavior
You are confusing a memory barrier used for inter thread memory visibility and a compiler barrier, which isn't a thread device, just a device (or trick) to prevent reordering of side effects by the compiler.
You need a memory barrier for your threading example.
You can use a compiler barrier to ensure that memory side effet are performed in a given order (on the local CPU) for other purposes, like benchmarking, getting around a type aliasing violation, integrating assembly code, or signal handling (for a signal only handled in that same thread).

C++: Thread Safety in a Signal/Slot Library

I'm implementing a Signal/Slot framework, and got to the point that I want it to be thread-safe. I already had a lot of support from the Boost mailing-list, but since this is not really boost-related, I'll ask my pending question here.
When is a signal/slot implementation (or any framework that calls functions outside itself, specified in some way by the user) considered thread-safe? Should it be safe w.r.t. its own data, i.e. the data associated to its implementation details? Or should it also take into account the user's data, which might or might not be modified whatever functions are passed to the framework?
This is an example given on the mailing-list (Edit: this is an example use-case --i.e. user code--. My code is behind the calls to the Emitter object):
int * somePtr = nullptr;
Emitter<Event> em; // just an object that can emit the 'Event' signal
void mainThread()
{
em.connect<Event>(someFunction);
// now, somehow, 2 threads are created which, at some point
// execute the thread1() and thread2() functions below
}
void someFunction()
{
// can somePtr change after the check but before the set?
if (somePtr)
*somePtr = 17;
}
void cleanupPtr()
{
// this looks safe, but compilers and CPUs can reorder this code:
int *tmp = somePtr;
somePtr = null;
delete tmp;
}
void thread1()
{
em.emit<Event>();
}
void thread2()
{
em.disconnect<Event>(someFunction);
// now safe to cleanup (?)
cleanupPtr();
}
In the above code, it might happen that Event is emitted, causing someFunction to be executed. If somePtr is non-null, but becomes null just after the if, but before the assignment, we're in trouble. From the point of view of thread2, this is not obvious because it is disconnecting someFunction before calling cleanupPtr.
I can see why this could potentially lead to trouble, but who's responsibility is this? Should my library protect the user from using it in every irresponsible but imaginable way?
I suspect there is no clearly good answer, but clarity will come from documenting the guarantees you wish to make about concurrent access to an Emitter object.
One level of guarantee, which to me is what is implied by a promise of thread safety, is that:
Concurrent operations on the object are guaranteed to leave the object in a consistent state (at least, from the point of view of the accessing threads.)
Non-commutative operations will be performed as if they were scheduled serially in some (unknown) order.
Then the question is, what does the emit method promise semantically: passing control to the connected routine, or evaluation of the function? If the former, then your work sounds like it is already done; if the latter, then the 'as-if ordered' requirement would mean that you need to enforce some level of synchronisation.
Users of the library can work with either, provided it is clear what is being promised.
Firstly the simplest possibility: If you don't claim your library to be thread-safe, you don't have to bother about this.
(But even) if you do:
In your example the user would have to take care about thread-safety, since both functions could be dangerous, even without using your event-system (IMHO, this is a pretty good way to determine who should take care about those kind of problems). A possible way for him to do this in C++11 could be:
#include <mutex>
// A mutex is used to control thread-acess to a shared resource
std::mutex _somePtr_mutex;
int* somePtr = nullptr;
void someFunction()
{
/*
Create a 'lock_guard' to manage your mutex.
Is the mutex '_somePtr_mutex' already locked?
Yes: Wait until it's unlocked.
No: Lock it and continue execution.
*/
std::lock_guard<std::mutex> lock(_somePtr_mutex);
if(somePtr)
*somePtr = 17;
// End of scope: 'lock' gets destroyed and hence unlocks '_somePtr_mutex'
}
void cleanupPtr()
{
/*
Create a 'lock_guard' to manage your mutex.
Is the mutex '_somePtr_mutex' already locked?
Yes: Wait until it's unlocked.
No: Lock it and continue execution.
*/
std::lock_guard<std::mutex> lock(_somePtr_mutex);
int *tmp = somePtr;
somePtr = null;
delete tmp;
// End of scope: 'lock' gets destroyed and hence unlocks '_somePtr_mutex'
}
The last question is easy. If you say your library is threadsafe, it should threadsafe. It makes no sense to say it is partly threadsafe or, it is only threadsafe if you do not abuse it. In that case you have to explain what exactly is not threadsafe.
Now to your first question regarded someFunction:
The operation is non atomic. Which means the CPU can interrupt between the if and the assigment. And that will happen, I know that :-) The other thread can erase the pointer anytime. Even between two short and fast looking statements.
Now to cleanupPtr:
I am not a compiler expert, but if you want to be shure that your assigment take place in the same moment you wrote it in code you should write the keyword volatile in front of the declaration of somePtr. The compiler will now know that you use that attribute in a multithreaded situation and will not buffer the value in a register of the CPU.
If you have a thread situation with a reader thread and a writer thread, the keyword volatile can (IMHO) be enough to sync them. As long as the attributes you use to exchange information between threads are generic.
For other situations you can use mutex or atomics. I will give you an example for mutex. I use C++11 for that, but it works similar with previous versions of C++ using boost.
Using mutex:
int * somePtr = nullptr;
Emitter<Event> em; // just an object that can emit the 'Event' signal
std::recursive_mutex g_mutex;
void mainThread()
{
em.connect<Event>(someFunction);
// now, somehow, 2 threads are created which, at some point
// execute the thread1() and thread2() functions below
}
void someFunction()
{
std::lock_guard<std::recursive_mutex> lock(g_mutex);
// can somePtr change after the check but before the set?
if (somePtr)
*somePtr = 17;
}
void cleanupPtr()
{
std::lock_guard<std::recursive_mutex> lock(g_mutex);
// this looks safe, but compilers and CPUs can reorder this code:
int *tmp = somePtr;
somePtr = null;
delete tmp;
}
void thread1()
{
em.emit<Event>();
}
void thread2()
{
em.disconnect<Event>(someFunction);
// now safe to cleanup (?)
cleanupPtr();
}
I only added a recursive mutex here without changing any other code of the sample, even if it's now cargo code.
There are two kinds of mutex in the std. A utterly useless std::mutex and the std::recursive_mutex which work like you expect a mutex should work. The std::mutex exclude the access of any further call even from the same thread. Which can happen if a method which needs mutex protection calls a public method which use the same mutex. std::recursive_mutex is reentrant for the same thread.
Atomics (or interlocks in win32) are another way, but only to exchange values between threads or access them concurrently. Your example is missing such values, but in your case, I would look a little deeper in them (std::atomic).
UPDATE
If your are the user of a library which is not explicit declared as threadsafe by the developer, take it as non threadsafe and shield every call to it with a mutex lock.
To stick with the example. If you cannot change someFunction the you have to wrap the function like:
void threadsafeSomeFunction()
{
std::lock_guard<std::recursive_mutex> lock(g_mutex);
someFunction();
}

Proper compiler intrinsics for double-checked locking?

When implementing double-checked locking, what is the proper way to do the memory and/or compiler barriers when implementing double-checked locking for initialization?
Something like std::call_once isn't what I want; it's way too slow. It's typically just implemented on top of pthread_mutex_lock and EnterCriticalSection respective to OS.
In my programs, I often run into initialization cases where the initialization is safe to repeat, as long as exactly one thread gets to set the final pointer. If another thread beats it to setting the final pointer to the singleton object, it deletes what it created and makes use of the other thread's. I also often use this in cases where it doesn't matter which thread "wins" because they all come up with the same result.
Here's an unsafe, overly-contrived example, using Visual C++ intrinsics:
MyClass *GetGlobalMyClass()
{
static MyClass *const UNSET_POINTER = reinterpret_cast<MyClass *>(
static_cast<intptr_t>(-1));
static MyClass *volatile s_object = UNSET_POINTER;
if (s_object == UNSET_POINTER)
{
MyClass *newObject = MyClass::Create();
if (_InterlockedCompareExchangePointer(&s_object, newObject,
UNSET_POINTER) != UNSET_POINTER)
{
// Another thread beat us. If Create didn't return null, destroy.
if (newObject)
{
newObject->Destroy(); // calls "delete this;", presumably
}
}
}
return s_object;
}
On a weakly-ordered memory architecture, my understanding is that it's possible that the new value of s_object is visible to other threads before other variables written inside MyClass::Create or MyClass::MyClass are visible. Also, the compiler itself could arrange the code this way in the absence of a compiler barrier (in Visual C++, _WriteBarrier, but _InterlockedCompareExchange acts as a barrier).
Do I need like a store fence intrinsic function in there or something in order to ensure that MyClass's variables are visible to all threads before s_object becomes somethings besides -1?
Fortunately, the rules in C++ are very simple:
If there is a data race, the behaviour is undefined.
In you code the data race is caused by the following read, which conflicts with the write operation in __InterlockedCompareExchangePointer.
if (s_object.m_void == UNSET_POINTER)
A thread-safe solution without blocking might look as follows. Note that on x86 a load operation with sequential consistency has basically no overhead compared to a regular load operation. If you care about other architectures, you can also use acquire release instead of sequential consistency.
static std::atomic<MyClass*> s_object{nullptr};
MyClass* o = s_object.load(std::memory_order_seq_cst);
if (o == nullptr) {
o = new MyClass{...};
MyClass* expected = nullptr;
if (!s_object.compare_exchange_strong(expected, o, std::memory_order_seq_cst)) {
delete o;
o = expected;
}
}
return o;
For a proper C++11 implementation any function-local static variable will be constructed in a thread-safe fashion by the first thread passing through this variable.

Is a race condition possible when only one thread writes to a bool variable in c++?

In the following code example, program execution never ends.
It creates a thread which waits for a global bool to be set to true before terminating. There is only one writer and one reader. I believe that the only situation that allows the loop to continue running is if the bool variable is false.
How is it possible that the bool variable ends up in an inconsistent state with just one writer?
#include <iostream>
#include <pthread.h>
#include <unistd.h>
bool done = false;
void * threadfunc1(void *) {
std::cout << "t1:start" << std::endl;
while(!done);
std::cout << "t1:done" << std::endl;
return NULL;
}
int main()
{
pthread_t threads;
pthread_create(&threads, NULL, threadfunc1, NULL);
sleep(1);
done = true;
std::cout << "done set to true" << std::endl;
pthread_exit(NULL);
return 0;
}
There's a problem in the sense that this statement in threadfunc1():
while(!done);
can be implemented by the compiler as something like:
a_register = done;
label:
if (a_register == 0) goto label;
So updates to done will never be seen.
There is really nothing that prevents the compiler from optimizing the while-loop away. Use atomic or a mutex to access the bool from more than one thread. That is the only supported and correct solution. As you are using posix, a mutex would be the right solution in this case.
And don't use volatile. There is a posix standard that states what has to work and volatile is not a solution that has a guaranty to work.
And there is an othere problem: There is no guaranty that your newly created thread every started to run, before you set the flag to false.
For such simple example volatile is enough. But for vast majority of real world situations it is not. Use conditional variable for this task. They look weird at the first glance but actually they are quite logical. On x86 bool IS atomic to read/write (for ARM, probably, not). Also there is an obstacle with vector: it is NOT a vector of bools, it is a bitfield. To write vector from several threads use vector (or bool arr[SIZE]).
Also you don't join with thread, it is wrong.
Race condition means: when two threads are accessing the same object, and at least one of them is a write.
It means you will have two types of racing, write-write conflict and write-read conflict.
Back to your code, you essentially have two threads, one is the main thread, and another one is the one you created with pthread_create.
One of them is a read: while(!done), and one of them is a write: done = true.
You have race condition for sure.
Is a race condition possible when only one thread writes to a bool variable in c++?
Yes. In your case, the main thread is also a thread (i.e. you have one thread writing and one thread reading).
How is it possible that the bool variable ends up in an inconsistent state with just one writer?
The compiler is (should be) an optimizing compiler. It will probably optimize the reading of the done variable, unless you take care to avoid that (use std::atomic<bool> done instead).
its not guaranteed that the assignment to a bool which is one byte is atomic

Do I need to use volatile keyword if I declare a variable between mutexes and return it?

Let's say I have the following function.
std::mutex mutex;
int getNumber()
{
mutex.lock();
int size = someVector.size();
mutex.unlock();
return size;
}
Is this a place to use volatile keyword while declaring size? Will return value optimization or something else break this code if I don't use volatile? The size of someVector can be changed from any of the numerous threads the program have and it is assumed that only one thread (other than modifiers) calls getNumber().
No. But beware that the size may not reflect the actual size AFTER the mutex is released.
Edit:If you need to do some work that relies on size being correct, you will need to wrap that whole task with a mutex.
You haven't mentioned what the type of the mutex variable is, but assuming it is an std::mutex (or something similar meant to guarantee mutual exclusion), the compiler is prevented from performing a lot of optimizations. So you don't need to worry about return value optimization or some other optimization allowing the size() query from being performed outside of the mutex block.
However, as soon as the mutex lock is released, another waiting thread is free to access the vector and possibly mutate it, thus changing the size. Now, the number returned by your function is outdated. As Mats Petersson mentions in his answer, if this is an issue, then the mutex lock needs to be acquired by the caller of getNumber(), and held until the caller is done using the result. This will ensure that the vector's size does not change during the operation.
Explicitly calling mutex::lock followed by mutex::unlock quickly becomes unfeasible for more complicated functions involving exceptions, multiple return statements etc. A much easier alternative is to use std::lock_guard to acquire the mutex lock.
int getNumber()
{
std::lock_guard<std::mutex> l(mutex); // lock is acquired
int size = someVector.size();
return size;
} // lock is released automatically when l goes out of scope
Volatile is a keyword that you use to tell the compiler to literally actually write or read the variable and not to apply any optimizations. Here is an example
int example_function() {
int a;
volatile int b;
a = 1; // this is ignored because nothing reads it before it is assigned again
a = 2; // same here
a = 3; // this is the last one, so a write takes place
b = 1; // b gets written here, because b is volatile
b = 2; // and again
b = 3; // and again
return a + b;
}
What is the real use of this? I've seen it in delay functions (keep the CPU busy for a bit by making it count up to a number) and in systems where several threads might look at the same variable. It can sometimes help a bit with multi-threaded things, but it isn't really a threading thing and is certainly not a silver bullet