I have two threads that share a common variable.
The code structure is basically this (very simplified pseudo code):
static volatile bool commondata;
void Thread1()
{
...
commondata = true;
...
}
void Thread2()
{
...
while (!commondata)
{
...
}
...
}
Both threads run and at some point Thread1 sets commondata to true. The while loop in Thread2 should then stop. The important thing here is that Thread2 "sees" the changement made to commondata by Thread1.
I know that the naive method using a volatile variable is not correct and is not guaranteed to work on every platform.
Is it good enough to replace volatile bool commondata with std::atomic<bool> commondata?
Simple answer: yes! :)
All operations on atomics are data race free and by default sequentially consistent.
There is a nice caveat here. While generally 'yes', std::atomic does not, by itself, make the variable volatile. Which means, if compiler can (by some unfathomable means) infer that the variable did not change, it is allowed to not re-read it, since it might assume reading has no side-effects.
If you check, there are overloads for both volatile and non-volatile versions of the class: http://eel.is/c++draft/atomics.types.generic
That could become important if atomic variable lives in shared memory, for example.
Related
I have a global reference-counted object obj that I want to protect from data races by using atomic operations:
T* obj; // initially nullptr
std::atomic<int> count; // initially zero
My understanding is that I need to use std::memory_order_release after I write to obj, so that the other threads will be aware of it being created:
void increment()
{
if (count.load(std::memory_order_relaxed) == 0)
obj = std::make_unique<T>();
count.fetch_add(1, std::memory_order_release);
}
Likewise, I need to use std::memory_order_acquire when reading the counter, to ensure the thread has visibility of obj being changed:
void decrement()
{
count.fetch_sub(1, std::memory_order_relaxed);
if (count.load(std::memory_order_acquire) == 0)
obj.reset();
}
I am not convinced that the code above is correct, but I'm not entirely sure why. I feel like after obj.reset() is called, there should be a std::memory_order_release operation to inform other threads about it. Is that correct?
Are there other things that can go wrong, or is my understanding of atomic operations in this case completely wrong?
It is wrong regardless of memory ordering.
As #MaartenBamelis pointed out for concurrent calling of increment the object is constructed twice. And the same is true for concurrent decrement: object is reset twice (which may result in double destructor call).
Note that there's disagreement between T* obj; declaration and using it as unique_ptr but neither raw pointer not unique pointer are safe for concurrent modification. In practice, reset or delete will check pointer for null, then delete and set it to null, and these steps are not atomic.
fetch_add and fetch_sub are fetch and op instead of just op for a reason: if you don't use the value observed during operation, it is likely to be a race.
This code is inherently racey. If two threads call increment at the same time when count is initially 0, both will see count as 0, and both will create obj (and race to see which copy is kept; given unique_ptr has no special threading protections, terrible things can happen if two of them set it at once).
If two threads decrement at the same time (holding the last two references), and finish the fetch_sub before either calls load, both will reset obj (also bad).
And if a decrement finishes the fetch_sub (to 0), then another thread increments before the decrement load occurs, the increment will see the count as 0 and reinitialize. Whether the object is cleared after being replaced, or replaced after being cleared, or some horrible mixture of the two, will depend on whether increment's fetch_add runs before or after decrement's load.
In short: If you find yourself using two separate atomic operations on the same variable, and testing the result of one of them (without looping, as in a compare and swap loop), you're wrong.
More correct code would look like:
void increment() // Still not safe
{
// acquire is good for the != 0 case, for a later read of obj
// or would be if the other writer did a release *after* constructing an obj
if (count.fetch_add(1, std::memory_order_acquire) == 0)
obj = std::make_unique<T>();
}
void decrement()
{
if (count.fetch_sub(1, std::memory_order_acquire) == 1)
obj.reset();
}
but even then it's not reliable; there's no guarantee that, when count is 0, two threads couldn't call increment, both of them fetch_add at once, and while exactly one of them is guaranteed to see the count as 0, said 0-seeing thread might end up delayed while the one that saw it as 1 assumes the object exists and uses it before it's initialized.
I'm not going to swear there's no mutex-free solution here, but dealing with the issues involved with atomics is almost certainly not worth the headache.
It might be possible to confine the mutex to inside the if() branches, but taking a mutex is also an atomic RMW operation (and not much more than that for a good lightweight implementation) so this doesn't necessarily help a huge amount. If you need really good read-side scaling, you'd want to look into something like RCU instead of a ref-count, to allow readers to truly be read-only, not contending with other readers.
I don't really see a simple way of implementing a reference-counted resource with atomics. Maybe there's some clever way that I haven't thought of yet, but in my experience, clever does not equal readable.
My advice would be to implement it first using a mutex. Then you simply lock the mutex, check the reference count, do whatever needs to be done, and unlock again. It's guaranteed correct:
std::mutex mutex;
int count;
std::unique_ptr<T> obj;
void increment()
{
auto lock = std::scoped_lock{mutex};
if (++count == 1) // Am I the first reference?
obj = std::make_unique<T>();
}
void decrement()
{
auto lock = std::scoped_lock{mutex};
if (--count == 0) // Was I the last reference?
obj.reset();
}
Although at this point, I would just use a std::shared_ptr instead of managing the reference count myself:
std::mutex mutex;
std::weak_ptr<T> obj;
std::shared_ptr<T> acquire()
{
auto lock = std::scoped_lock{mutex};
auto sp = obj.lock();
if (!sp)
obj = sp = std::make_shared<T>();
return sp;
}
I believe this also makes it safe when exceptions may be thrown when constructing the object.
Mutexes are surprisingly performant, so I expect that locking code is plenty quick unless you have a highly specialized use case where you need code to be lock-free.
I recently came across some code that was working fine where a static bool was shared between multiple threads (single writer, multiple receivers) although there was no synchronization.
Something like that (simplified):
//header A
struct A {
static bool f;
static bool isF() { return f; }
};
//Source A
bool A::f = false;
void threadWriter(){
/* Do something */
A::f = true;
}
// Source B
void threadReader(){
while (!A::isF()) { /* Do something */}
}
For me, this kind of code has a race condition in that even though operations on bool are atomic (on most CPUs), we have no guarantee that the write from the writer thread will be visible to the reader threads. But some people told me that the fact that f is static would help.
So, is there anything in C++11 that would make this code safe? Or anything related to static that would make this code work?
Your hardware may be able to atomically operate on a bool. However, that does not make this code safe. As far as C++ is concerned, you are writing and reading the bool in different threads without synchronisation, which is undefined.
Making the bool static does not change that.
To access the bool in a thread-safe way you can use a std::atomic<bool>. Whether the atomic uses a mutex or other locking is up to the implementation.
Though, also a std::atomic<bool> is not sufficient to synchronize the threadReader() and threadWriter() in case each /*Do something */ is accessing the same shared data.
But some people told me that the fact that f is static would help.
Frankly, this sounds like cargo-cult. I can imagine that this was confused with the fact that initialization of static local variables is thread safe. From cppreference:
If multiple threads attempt to initialize the same static local
variable concurrently, the initialization occurs exactly once (similar
behavior can be obtained for arbitrary functions with std::call_once).
Note: usual implementations of this feature use variants of the
double-checked locking pattern, which reduces runtime overhead for
already-initialized local statics to a single non-atomic boolean
comparison.
Look for Meyers singleton to see an example of that. Though, this is merely about initialization. For example here:
int& foo() {
static int x = 42;
return x;
}
Two threads can call this function concurrently and x will be initialized exactly once. This has no impact on thread-safety of x itself. If two threads call foo and one writes and another reads x there is a data race. However, this is about initialization of static local variables and has nothing to do with your example. I don't know what they meant when they told you static would "help".
Reading this article about Double Checked Locking Pattern in C++, I reached the place (page 10) where the authors demonstrate one of the attempts to implement DCLP "correctly" using volatile variables:
class Singleton {
public:
static volatile Singleton* volatile instance();
private:
static volatile Singleton* volatile pInstance;
};
// from the implementation file
volatile Singleton* volatile Singleton::pInstance = 0;
volatile Singleton* volatile Singleton::instance() {
if (pInstance == 0) {
Lock lock;
if (pInstance == 0) {
volatile Singleton* volatile temp = new Singleton;
pInstance = temp;
}
}
return pInstance;
}
After such example there is a text snippet that I don't understand:
First, the Standard’s constraints on observable behavior are only for
an abstract machine defined by the Standard, and that abstract machine
has no notion of multiple threads of execution. As a result, though
the Standard prevents compilers from reordering reads and writes to
volatile data within a thread, it imposes no constraints at all on
such reorderings across threads. At least that’s how most compiler
implementers interpret things. As a result, in practice, many
compilers may generate thread-unsafe code from the source above.
and later:
... C++’s abstract machine is single-threaded, and C++ compilers may
choose to generate thread-unsafe code from source like the above,
anyway.
These remarks are related to the execution on the uni-processor, so it's definitely not about cache-coherence issues.
If the compiler can't reorder reads and writes to volatile data within a thread, how can it reorder reads and writes across threads for this particular example thus generating thread-unsafe code?
The pointer to the Singleton may be volatile, but the data within the singleton is not.
Imagine Singleton has int x, y, z; as members, set to 15, 16, 17 in the constructor for some reason.
volatile Singleton* volatile temp = new Singleton;
pInstance = temp;
OK, temp is written before pInstance. When are x,y,z written relative to those? Before? After? You don't know. They aren't volatile, so they don't need to be ordered relative to the volatile ordering.
Now a thread comes in and sees:
if (pInstance == 0) { // first check
And let's say pInstance has been set, is not null.
What are the values of x,y,z? Even though new Singleton has been called, and the constructor has "run", you don't know whether the operations that set x,y,z have run or not.
So now your code goes and reads x,y,z and crashes, because it was really expecting 15,16,17, not random data.
Oh wait, pInstance is a volatile pointer to volatile data! So x,y,z is volatile right? Right? And thus ordered with pInstance and temp. Aha!
Almost. Any reads from *pInstance will be volatile, but the construction via new Singleton was not volatile. So the initial writes to x,y,z were not ordered. :-(
So you could, maybe, make the members volatile int x, y, z; OK. However...
C++ now has a memory model, even if it didn't when the article was written. Under the current rules, volatile does not prevent data races. volatile has nothing to do with threads. The program is UB. Cats and Dogs living together.
Also, although this is pushing the limits of the standard (ie it gets vague as to what volatile really means), an all-knowing, all-seeing, full-program-optimizing compiler could look at your uses of volatile and say "no, those volatiles don't actually connect to any IO memory addressses etc, they really aren't observable behaviour, I'm just going to make them non-volatile"...
I think they're referring to the cache coherency problem discussed in section 6 ("DCLP on Multiprocessor Machines". With a multiprocessor system, the processor/cache hardware may write out the value for pInstance before the values are written out for the allocated Singleton. This can cause a 2nd CPU to see the non-NULL pInstance before it can see the data it points to.
This requires a hardware fence instruction to ensure all the memory is updated before other CPUs in the system can see any of it.
If I'm understanding correctly they are saying that in the context of a single-thread abstract machine the compiler may simply transform:
volatile Singleton* volatile temp = new Singleton;
pInstance = temp;
Into:
pInstance = new Singleton;
Because the observable behavior is unchanged. Then this brings us back to the original problem with double checked locking.
I have read many questions considering thread-safe double checked locking (for singletons or lazy init). In some threads, the answer is that the pattern is entirely broken, others suggest a solution.
So my question is: Is there a way to write a fully thread-safe double checked locking pattern in C++? If so, how does it look like.
We can assume C++11, if that makes things easier. As far as I know, C++11 improved the memory model which could yield the needed improvements.
I do know that it is possible in Java by making the double-check guarded variable volatile. Since C++11 borrowed large parts of the memory model from the one of Java, so I think it could be possible, but how?
Simply use a static local variable for lazily initialized Singletons, like so:
MySingleton* GetInstance() {
static MySingleton instance;
return &instance;
}
The (C++11) standard already guarantees that static variables are initialized in a threadsafe manner and it seems likely that the implementation of this at least as robust and performant as anything you'd write yourself.
The threadsafety of the initialization can be found in §6.7.4 of the (C++11) standard:
If control enters the declaration concurrently while the variable is being initialized, the concurrent execution shall wait for completion of the initialization.
Since you wanted to see a valid DCLP C++11 implementation, here is one.
The behavior is fully thread-safe and identical to GetInstance() in Grizzly's answer.
std::mutex mtx;
std::atomic<MySingleton *> instance_p{nullptr};
MySingleton* GetInstance()
{
auto *p = instance_p.load(std::memory_order_acquire);
if (!p)
{
std::lock_guard<std::mutex> lck{mtx};
p = instance_p.load(std::memory_order_relaxed);
if (!p)
{
p = new MySingleton;
instance_p.store(p, std::memory_order_release);
}
}
return p;
}
Is the following singleton implementation data-race free?
static std::atomic<Tp *> m_instance;
...
static Tp &
instance()
{
if (!m_instance.load(std::memory_order_relaxed))
{
std::lock_guard<std::mutex> lock(m_mutex);
if (!m_instance.load(std::memory_order_acquire))
{
Tp * i = new Tp;
m_instance.store(i, std::memory_order_release);
}
}
return * m_instance.load(std::memory_order_relaxed);
}
Is the std::memory_model_acquire of the load operation superfluous? Is it possible to further relax both load and store operations by switching them to std::memory_order_relaxed? In that case, is the acquire/release semantic of std::mutex enough to guarantee its correctness, or a further std::atomic_thread_fence(std::memory_order_release) is also required to ensure that the writes to memory of the constructor happen before the relaxed store? Yet, is the use of fence equivalent to have the store with memory_order_release?
EDIT: Thanks to the answer of John, I came up with the following implementation that should be data-race free. Even though the inner load could be non-atomic at all, I decided to leave a relaxed load in that it does not affect the performance. In comparison to always have an outer load with the acquire memory order, the thread_local machinery improves the performance of accessing the instance of about an order of magnitude.
static Tp &
instance()
{
static thread_local Tp *instance;
if (!instance &&
!(instance = m_instance.load(std::memory_order_acquire)))
{
std::lock_guard<std::mutex> lock(m_mutex);
if (!(instance = m_instance.load(std::memory_order_relaxed)))
{
instance = new Tp;
m_instance.store(instance, std::memory_order_release);
}
}
return *instance;
}
I think this a great question and John Calsbeek has the correct answer.
However, just to be clear a lazy singleton is best implemented using the classic Meyers singleton. It has garanteed correct semantics in C++11.
§ 6.7.4
... If control enters
the declaration concurrently while the variable is being initialized, the concurrent execution shall wait for
completion of the initialization. ...
The Meyer's singleton is preferred in that the compiler can aggressively optimize the concurrent code. The compiler would be more restricted if it had to preserve the semantics of a std::mutex. Furthermore, the Meyer's singleton is 2 lines and virtually impossible to get wrong.
Here is a classic example of a Meyer's singleton. Simple, elegant, and broken in c++03. But simple, elegant, and powerful in c++11.
class Foo
{
public:
static Foo& instance( void )
{
static Foo s_instance;
return s_instance;
}
};
That implementation is not race-free. The atomic store of the singleton, while it uses release semantics, will only synchronize with the matching acquire operation—that is, the load operation that is already guarded by the mutex.
It's possible that the outer relaxed load would read a non-null pointer before the locking thread finished initializing the singleton.
The acquire that is guarded by the lock, on the other hand, is redundant. It will synchronize with any store with release semantics on another thread, but at that point (thanks to the mutex) the only thread that can possibly store is the current thread. That load doesn't even need to be atomic—no stores can happen from another thread.
See Anthony Williams' series on C++0x multithreading.
See also call_once.
Where you'd previously use a singleton to do something, but not actually use the returned object for anything, call_once may be the better solution.
For a regular singleton you could do call_once to set a (global?) variable and then return that variable...
Simplified for brevity:
template< class Function, class... Args>
void call_once( std::once_flag& flag, Function&& f, Args&& args...);
Exactly one execution of exactly one of the functions, passed as f to the invocations in the group (same flag object), is performed.
No invocation in the group returns before the abovementioned execution of the selected function is completed successfully