double checked locking pattern in c++ concurrent programming

double checked locking pattern in c++ concurrent programming - c++

I am reading concurrency programming in c++ and came across this piece of code. the book mentioned the potential for nasty race conditions.
void undefined_behaviour_with_double_checked_locking(){
if(!resource_ptr){ //<1>
std::lock_guard<std::mutex> lk(resource_mutex);
if(!resource_ptr){ //<2>
resource_ptr.reset(new some_resource); //<3>
}
}
resource_ptr->do_something(); //<4>
}
here is the quote of explanation from the book. however, i just cant come up with a real example. I wonder if anyone here could help me out.
Unfortunately, this pattern is infamous for a reason: it has the
potential for nasty race conditions, because the read outside the lock
<1> isn’t synchronized with the write done by another thread inside
the lock <3>. This therefore creates a race condition that covers not
just the pointer itself but also the object pointed to; even if a
thread sees the pointer written by another thread, it might not see
the newly created instance of some_resource, resulting in the call to
do_something() <4> operating on incorrect values.

You don't show what resource_ptr is but from the explanation the reasoning seems to be that "!resource_ptr" (outside the lock) and "resource_ptr.reset" (inside the lock) are not atmoic and are not synchronized with each other.
The use case would be:
thread1 comes into the method, sees that resource_ptr is not
populated, enters the lock and is in the middle of the
resource_ptr.reset.
thread2 comes into the method and is when
checking !resource_ptr may see it as set but resource_ptr may not be
fully configured for use.
thread2 falls through to execute "resource_ptr->do_something()" and may see resource_ptr in an inconsistent state and bad things may happen.

I recommend you read this: http://www.aristeia.com/Papers/DDJ_Jul_Aug_2004_revised.pdf.
Anyway, the gist of it is: the compiler is free to reorder operations as long as they appear to be executed in the program's order in a single threaded situation. On top of that, some CPU architectures take the same liberties with their instruction execution order. So, technically resource_ptr could be modified to point to newly allocated memory before some_resource's constructor has finished. Another thread could at that time see that resource_ptr is not null and attempt to use the not-yet-fully-constructed instance.
The use of a smart pointer instead of a raw pointer might make this less likely, but it doesn't rule it out afaik.

The potential problem is that the write to resource_ptr isn't atomic (inside the reset call). Assuming that resource_ptr is a global or static variable that (/ or otherwise) starts initialized with the value NULL before we get here, it will never cause a thread to fall-through unless the object some_resource is already fully allocated and constructed, however - say that the pointer to this new object is 0x123456789, then it is theoretically possible that resource_ptr has, for example, the value 0x12340000 when another thread does the if (!resource_ptr) test, falls through and uses that value (especially more likely when using aliasing). If resource_ptr is an atomic variable then this code would be fine.
If a program can guarantee that the first time this code is called there is only one thread running (ie, the first call will be from main() before any other thread is created) then this will work fine too, because once initialized, the if test will just always fall through, resulting in only read accesses to resource_ptr while more than one thread is running. In that case you don't need the lock inside the if block though, and you are not allowed to ever write to resource_ptr anywhere else.

Related

A legitimate use case for volatile in C++?

I'm running a few threads that basically are all returning the same object as a result. Then I wait for all of them to complete, and basically read the results. To avoid needing synchronization, I figured I could just pre-allocate all the result objects in an array or vector and give the threads a pointer to each. At a high level, the code is something like this (simplified):
std::vector<Foo> results(2);
RunThread1(&results[0]);
RunThread2(&results[1]);
WaitForAll();
// Read results
cout << results[0].name << results[1].name;
Basically I'd like to know if there's anything potentially unsafe about this "code". One thing I was wondering is whether the vector should declared volatile such that the reads at the end aren't optimized and output an incorrect value.

The short answer to your question is no, the array should not be declared volatile. For two simple reasons:
Using volatile is not necessary. Every sane multithreading platform provides synchronization primitives with well-defined semantics. If you use them, you don't need volatile.
Using volatile isn't sufficient. Since volatile doesn't have defined multithread semantics on any platform you are likely to use, it alone is not enough to provide synchronization.
Most likely, whatever you do in WaitForAll will be sufficient. For example, if it uses an event, mutex, condition variable, or almost anything like that, it will have defined multithreading semantics that are sufficient to make this safe.
Update: "Just out curiosity, what would be an example of something that happens in the WaitForAll that guarantees safety of the read? Wouldn't it need to effectively tell the compiler somehow to "flush" the cache or avoid optimizations of subsequent read operations?"
Well, if you're using pthreads, then if it uses pthread_join that would be an example that guarantees safety of the read because the documentation says that anything the thread does is visible to a thread that joins it after pthread_join returns.
How it does it is an implementation detail. In practice, on modern systems, there are no caches to flush nor are there any optimizations of subsequent reads that are possible but need to be avoided.
Consider if somewhere deep inside WaitForAll, there's a call to pthread_join. Generally, you simply don't let the compiler see into the internals of pthread_join and thus the compiler has to assume that pthread_join might do anything that another thread could do. So keeping information that another thread might modify in a register across a call to pthread_join would be illegal because pthread_join itself might access or modify that data.

I was wondering is whether the vector should declared volatile such that the reads at the end aren't optimized and output an incorrect value.
No. If there was problem with lack of synchronisation, then volatile would not help.
But there is no problem with lack of synchronisation, since according to your description, you don't access the same object from multiple threads - until you've waited for the threads to complete, which is something that synchronises the threads.
There is a potential problem that if the objects are small (less than about 64 bytes; depending on CPU architecture), then the objects in the array share a cache "line", and access to them may become effectively synchronised due to write contention. This is only a problem if the threads write to the variable a lot in relation to operations that don't access the output object.

It depends on what's in WaitForAll(). If it's a proper synchronization, all is good. Mutexes, for example, or thread join, will result in the proper memory synchronization being put in.
Volatile would not help. It may prevent compiler optimizations, but would not affect anything happening at the CPU level, like caches not being updated. Use proper synchronization, like mutexes, thread join, and then the result will be valid (sequentially coherent). Don't count on silver bullet volatile. Compilers and CPUs are now complex enough that it won't be guaranteed.
Other answers will elaborate on the memory fences and other instructions that the synchronization will put in. :-)

How to make an object in c++ volatile?

`struct MyClass {
~MyClass() {
// Asynchronously invoke deletion (erase) of entries from my_map;
// Different entries are deleted in different threads.
// Need to spin as 'this' object is shared among threads and
// destruction of the object will result in seg faults.
while(my_map.size() > 0); // This spins for ever due to complier optimization.
}
unordered_map<key, value> my_map;
};`
I have the above class in which elements of the unordered map are deleted asynchronoulsy in the destructor and I must spin/sleep as the object is shared among other threads. I cannot declare my_map as volatile as it results in compilation errors. What else can I do here ? How do I tell the complier that my_map.size() will result in 0 at some point in time. Please do not tell me why/how this design is bad; I cannot change the design as it is bound due to the reason I cannot explain unless I write thousands of lines of code here.
Edit: my_map is protected using a version of spinlock. So, threads do grab the spinlock before erasing the entries. Just the while(my_map.size() > 0); was the only "naive" spin I had in the code. I converted it to grab the spinlock and then check the size (in a loop) and it worked. Though using a condition_variable would be the right way of doing it, we use asynchronous programming model (like SEDA) which binds us to not use any sleeping/yeilding calls.

volatile is not the solution to this problem. volatile has exactly three uses: 1. Accessing memory mapped devices in a driver, 2. signal handlers, 3. setjmp usage.
Read the following, over and over until it sinks in. volatile is useless in multithreading.
A naive spin lock like that has three problems:
The compiler is permitted to cache results, therefore you see the "spin forever" behavior you're seeing.
In the classic case, you have the risk of a race condition: thread A may check the lock variable, find the resource is accessible, but then get pre-empted before setting the lock variable. Along comes thread B who also finds the lock variable showing the resource as accessible, so it then locks it and starts to access the resource, Then thread A wakes back up, locks the variable again, and also accesses the resource.
There is a data write order problem. If a protected variable is written to, and then a lock variable is changed, you have no guarantees that a different thread will not see the protected variable changed even though it may also see the lock variable claiming it has been written. Both the compiler and the Out of order execution on the CPU are permitted do this.
volatile only solves the first of these problems, it does nothing to address the other two. With one caveat, by default MSVC on x86 / x64 adds a memory fence to volatile accesses, even though it's not required by the standard. That happens to solve the third problem, but it still doesn't fix the second one.
The only solutions to all three of these problems involves use of correct synchronization primitives: std::atomic<> if you really must spin lock, preferably std::mutex and maybe std::condition_variable for a lock that will put the thread to sleep till something interesting happens.

std::promise set_value and thread safety

I'm a bit confused about the requirements in terms of thread-safety placed on std::promise::set_value().
The standard says:
Effects: Atomically stores the value r in the shared state and makes
that state ready
However, it also says that promise::set_value() can only be used to set a value once. If it is called multiple times, a std::future_error is thrown. So you can only set the value of a promise once.
And indeed, just about every tutorial, online code sample, or actual use case for std::promise involves a communication channel between 2 threads, where one thread calls std::future::get(), and the other thread calls std::promise::set_value().
I've never seen a use case where multiple threads might call std::promise::set_value(), and even if they did, all but one would cause a std::future_error exception to be thrown.
So why does the standard mandate that calls to std::promise::set_value() are atomic? What is the use case for calling std::promise::set_value() from multiple threads concurrently?
EDIT:
Since the top-voted answer here is not really answering my question, I assume what I'm asking is unclear. So, to clarify: I'm aware of what futures and promises are for and how they work. My question is why, specifically, does the standard insist that std::promise::set_value() must be atomic? This is a more subtle question than "why must there not be a race between calls to promise::set_value() and calls to future::get()"?
In fact, many of the answers here (incorrectly) respond that the reason is because if std::promise::set_value() wasn't atomic, then std::future::get() could potentially cause a race condition. But this is not true.
The only requirement to avoid a race condition is that std::promise::set_value() must have a happens-before relationship with std::future::get() - in other words, it must be guaranteed that when std::future::wait() returns, std::promise::set_value() has completed.
This is completely orthogonal to std::promise::set_value() itself being atomic or not. In a typical implementation using condition variables, std::future::get()/wait() would wait on a condition variable. Then, std::promise::set_value() could non-atomically perform any arbitrarily complex computation to set the actual value. Then it would notify the shared condition variable, (implying a memory fence with release semantics), and std::future::get() would wake up and safely read the result.
So, std::promise::set_value() itself does not need to be atomic to avoid a race condition here - it simply needs to satisfy a happens-before relationship with std::future::get().
So again, my question is: why does the C++ standard insist that std::promise::set_value() must actually be an atomic operation, as if a call to std::promise::set_value() was performed entirely under a mutex lock? I see no reason why this requirement should exist, unless there is some reason or use case for multiple threads calling std::promise::set_value() concurrently. And I can't think of such a use-case, hence this question.

If it was not an atomic store, then two threads could simultaneously call promise::set_value, which does the following:
check that the future is not ready (i.e., has a stored value or exception)
store the value
mark the state ready
release anything blocking on the shared state becoming ready
By making this sequence atomic, the first thread to execute (1) gets all the way through to (3), and any other thread calling promise::set_value at the same time will fail at (1) and raise a future_error with promise_already_satisfied.
Without the atomicity, two threads could potentially store their value, and then one would successfully mark the state ready, and the other would raise an exception, i.e. the same result, except that it might be the value from the thread that saw an exception that got through.
In many cases that might not matter which thread 'wins', but when it does matter, without the atomicity guarantee you would need to wrap another mutex around the promise::set_value call. Other approaches such as compare-and-exchange wouldn't work because you can't check the future (unless it's a shared_future) to see if your value won or not.
When it doesn't matter which thread 'wins', you could give each thread its own future, and use std::experimental::when_any to collect the first result that happened to become available.
Edit after some historical research:
Although the above (two threads using the same promise object) doesn't seem like a good use-case, it was certainly envisaged by one of the contemporary papers of the introduction of future to C++: N2744. This paper proposed a couple of use-cases which had such conflicting threads calling set_value, and I'll quote them here:
Second, consider use cases where two or more asynchronous operations are performed in parallel and "compete" to satisfy the promise. Some examples include:
A sequence of network operations (e.g. request a web page) is performed in conjunction with a wait on a timer.
A value may be retrieved from multiple servers. For redundancy, all servers are tried but only the first value obtained is needed.
In both examples, the first asynchronous operation to complete is the one that satisfies the promise. Since either operation may complete second, the code for both must be written to expect that calls to set_value() may fail.

I've never seen a use case where multiple threads might call
std::promise::set_value(), and even if they did, all but one would
cause a std::future_error exception to be thrown.
You missed the whole idea of promises and futures.
Usually, we have a pair of promise and a future. the promise is the object you push the asynchronous result or the exception, and the future is the object you pull the asynchronous result or the exception.
Under most cases, the future and the promise pair do not reside on the same thread, (otherwise we would use a simple pointer). so, you might pass the promise to some thread, threadpool, or some third library asynchronous function, and set the result from there, and pull the result in the caller thread.
setting the result with std::promise::set_value must be atomic, not because many promises set the result, but because an object (the future) which resides on another thread must read the result, and doing it un-atomically is undefined behavior, so setting the value and pulling it (either by calling std::future::get or std::future::then) must happen atomically
Remember, every future and promise has a shared state, setting the result from one thread updates the shared state, and getting the result reads from the shared state. like every shared state/memory in C++, when it's done from multiple threads, the update/reading must happen under a lock. otherwise it's undefined behavior.

These are all good answers, but there's one additional point that's essential. Without atomicity of setting a value, reading the value may be subject to observability side-effects.
E.g., in a naive implementation:
void thread1()
{
// do something. Maybe read from disk, or perform computation to populate value
v = value;
flag = true;
}
void thread2()
{
if(flag)
{
v2 = v;//Here we have a read problem.
}
}
Atomicity in the std::promise<> allows you to avoid the very basic race condition between writing a value in one thread and reading in another. Of course, if flag were std::atomic<> and the proper fence flags are used, you no longer have any side effects, and std::promise guarantees that.

C++ constructor memory synchronization

Assume that I have code like:
void InitializeComplexClass(ComplexClass* c);
class Foo {
public:
Foo() {
i = 0;
InitializeComplexClass(&c);
}
private:
ComplexClass c;
int i;
};
If I now do something like Foo f; and hand a pointer to f over to another thread, what guarantees do I have that any stores done by InitializeComplexClass() will be visible to the CPU executing the other thread that accesses f? What about the store writing zero into i? Would I have to add a mutex to the class, take a writer lock on it in the constructor and take corresponding reader locks in any methods that accesses the member?
Update: Assume I hand a pointer over to a bunch of other threads once the constructor has returned. I'm not assuming that the code is running on x86, but could be instead running on something like PowerPC, which has a lot of freedom to do memory reordering. I'm essentially interested in what sorts of memory barriers the compiler has to inject into the code when the constructor returns.

In order for the other thread to be able to know about your new object, you have to hand over the object / signal other thread somehow. For signaling a thread you write to memory. Both x86 and x64 perform all memory writes in order, CPU does not reorder these operations with regards to each other. This is called "Total Store Ordering", so CPU write queue works like "first in first out".
Given that you create an object first and then pass it on to another thread, these changes to memory data will also occur in order and the other thread will always see them in the same order. By the time the other thread learns about the new object, the contents of this object was guaranteed to be available for that thread even earlier (if the thread only somehow knew where to look).
In conclusion, you do not have to synchronise anything this time. Handing over the object after it has been initialised is all the synchronisation you need.
Update: On non-TSO architectures you do not have this TSO guarantee. So you need to synchronise. Use MemoryBarrier() macro (or any interlocked operation), or some synchronisation API. Signalling the other thread by corresponding API causes also synchronisation, otherwise it would not be synchronisation API.
x86 and x64 CPU may reorder writes past reads, but that is not relevant here. Just for better understanding - writes can be ordered after reads since writes to memory go through a write queue and flushing that queue may take some time. On the other hand, read cache is always consistent with latest updates from other processors (that have went through their own write queue).
This topic has been made so unbelievably confusing for so many, but in the end there is only a couple of things a x86-x64 programmer has to be worried about:
- First, is the existence of write queue (and one should not at all be worried about read cache!).
- Secondly, concurrent writing and reading in different threads to same variable in case of non-atomic variable length, which may cause data tearing, and for which case you would need synchronisation mechanisms.
- And finally, concurrent updates to same variable from multiple threads, for which we have interlocked operations, or again synchronisation mechanisms.)

If you do :
Foo f;
// HERE: InitializeComplexClass() and "i" member init are guaranteed to be completed
passToOtherThread(&f);
/* From this point, you cannot guarantee the state/members
of 'f' since another thread can modify it */
If you're passing an instance pointer to another thread, you need to implement guards in order for both threads to interact with the same instance. If you ONLY plan to use the instance on the other thread, you do not need to implement guards. However, do not pass a stack pointer like in your example, pass a new instance like this:
passToOtherThread(new Foo());
And make sure to delete it when you are done with it.

Explain race condition in double checked locking

void undefined_behaviour_with_double_checked_locking()
{
if(!resource_ptr) #1
{
std::lock_guard<std::mutex> lk(resource_mutex); #2
if(!resource_ptr) #3
{
resource_ptr.reset(new some_resource); #4
}
}
resource_ptr->do_something(); #5
}
if a thread sees the pointer written by another thread, it might not
see the newly-created instance of some_resource, resulting in the call
to do_something() operating on incorrect values. This is an example of
the type of race condition defined as a data race by the C++ Standard,
and thus specified as undefined behaviour.
Question> I have seen the above explanation for why the code has the double checked locking problem that causes the race condition. However, I still have difficulties to understand what the problem is. Maybe a concrete two-threads step-by-step workflow can help me really understand the race problem for the above the code.
One of the solution mentioned by the book is as follows:
std::shared_ptr<some_resource> resource_ptr;
std::once_flag resource_flag;
void init_resource()
{
resource_ptr.reset(new some_resource);
}
void foo()
{
std::call_once(resource_flag,init_resource); #1
resource_ptr->do_something();
}
#1 This initialization is called exactly once
Any comment is welcome
-Thank you

In this case (depending on the implementation of .reset and !) there may be a problem when Thread 1 gets part-way through initializing resource_ptr and then gets paused/switched. Thread 2 then comes along, performs the first check, sees that the pointer is not null, and skips the lock/fully-initialized check. It then uses the partially-initialized object (probably resulting in bad things happening). Thread 1 then comes back and finishes initializing, but it's too late.
The reason that a partially-initialized resource_ptr is possible is because the CPU is allowed to reorder instructions (as long as it doesn't change single-thread behaviour). So, while the code looks like it should fully-initialize the object and then assign it to resource_ptr, the optimized assembly code might be doing something quite different, and the CPU is also not guaranteed to run the assembly instructions in the order they are specified in the binary!
The takeaway is that when multiple threads are involved, memory fences (locks) are the only way guarantee that things happen in the right order.

The simplest problem scenario is in the case where the intialization of some_resource doesn't depend on resource_ptr. In that case, the compiler is free to assign a value to resource_ptr before it fully constructs some_resource.
For example, if you think of the operation of new some_resource as consisting of two steps:
allocate the memory for some_resource
initialize some_resource (for this discussion, I'm going to make the simplifying assumption that this initialization can't throw an exception)
Then you can see that the compiler could implement the mutex-protected section of code as:
1. allocate memory for `some_resource`
2. store the pointer to the allocated memory in `resource_ptr`
3. initialize `some_resource`
Now it becomes clear that if another thread executes the function between steps 2 and 3, then resource_ptr->do_something() could be called while some_resource has not been initialized.
Note that it's also possible on some processor architectures for this kind of reordering to occur in hardware unless the proper memory barriers are in place (and such barriers would be implemented by the mutex).

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js