list::empty() multi-threaded behavior? - c++

I have a list that I want different threads to grab elements from. In order to avoid locking the mutex guarding the list when it's empty, I check empty() before locking.
It's okay if the call to list::empty() isn't right 100% of the time. I only want to avoid crashing or disrupting concurrent list::push() and list::pop() calls.
Am I safe to assume VC++ and Gnu GCC will only sometimes get empty() wrong and nothing worse?
if(list.empty() == false){ // unprotected by mutex, okay if incorrect sometimes
mutex.lock();
if(list.empty() == false){ // check again while locked to be certain
element = list.back();
list.pop_back();
}
mutex.unlock();
}

It's okay if the call to list::empty() isn't right 100% of the time.
No, it is not okay. If you check if the list is empty outside of some synchronization mechanism (locking the mutex) then you have a data race. Having a data race means you have undefined behavior. Having undefined behavior means we can no longer reason about the program and any output you get is "correct".
If you value your sanity, you'll take the performance hit and lock the mutex before checking. That said, the list might not even be the correct container for you. If you can let us know exactly what you are doing with it, we might be able to suggest a better container.

There is a read and a write (most probably to the size member of std::list, if we assume that it's named like that) that are not synchronized in reagard to each other. Imagine that one thread calls empty() (in your outer if()) while the other thread entered the inner if() and executes pop_back(). You are then reading a variable that is, possibly, being modified. This is undefined behaviour.

As an example of how things could go wrong:
A sufficiently smart compiler could see that mutex.lock() cannot possibly change the list.empty() return value and thus skip the inner if check completely, eventually leading to a pop_back on a list that had its last element removed after the first if.
Why can it do that? There is no synchronization in list.empty(), thus if it were changed concurrently that would constitute a data race. The standard says that programs shall not have data races, so the compiler will take that for granted (otherwise it could perform almost no optimizations whatsoever). Hence it can assume a single-threaded perspective on the unsynchronized list.empty() and conclude that it must remain constant.
This is only one of several optimizations (or hardware behaviors) that could break your code.

Related

C++ member update visibility inside a critical section when not atomic

I stumbled across the following Code Review StackExchange and decided to read it for practice. In the code, there is the following:
Note: I am not looking for a code review and this is just a copy paste of the code from the link so you can focus in on the issue at hand without the other code interfering. I am not interested in implementing a 'smart pointer', just understanding the memory model:
// Copied from the link provided (all inside a class)
unsigned int count;
mutex m_Mutx;
void deref()
{
m_Mutx.lock();
count--;
m_Mutx.unlock();
if (count == 0)
{
delete rawObj;
count = 0;
}
}
Seeing this makes me immediately think "what if two threads enter when count == 1 and neither see the updates of each other? Can both end up seeing count as zero and double delete? And is it possible for two threads to cause count to become -1 and then deletion never happens?
The mutex will make sure one thread enters the critical section, however does this guarantee that all threads will be properly updated? What does the C++ memory model tell me so I can say this is a race condition or not?
I looked at the Memory model cppreference page and std::memory_order cppreference, however the latter page seems to deal with a parameter for atomic. I didn't find the answer I was looking for or maybe I misread it. Can anyone tell me if what I said is wrong or right, and whether or not this code is safe or not?
For correcting the code if it is broken:
Is the correct answer for this to turn count into an atomic member? Or does this work and after releasing the lock on the mutex, all the threads see the value?
I'm also curious if this would be considered the correct answer:
Note: I am not looking for a code review and trying to see if this kind of solution would solve the issue with respect to the C++ memory model.
#include <atomic>
#include <mutex>
struct ClassNameHere {
int* rawObj;
std::atomic<unsigned int> count;
std::mutex mutex;
// ...
void deref()
{
std::scoped_lock lock{mutex};
count--;
if (count == 0)
delete rawObj;
}
};
"what if two threads enter when count == 1" -- if that happens, something else is fishy. The idea behind smart pointers is that the refcount is bound to an object's lifetime (scope). The decrement happens when the object (via stack unrolling) is destroyed. If two threads trigger that, the refcount can not possibly be just 1 unless another bug is present.
However, what could happen is that two threads enter this code when count = 2. In that case, the decrement operation is locked by the mutex, so it can never reach negative values. Again, this assumes non-buggy code elsewhere. Since all this does is to delete the object (and then redundantly set count to zero), nothing bad can happen.
What can happen is a double delete though. If two threads at count = 2 decrement the count, they could both see the count = 0 afterwards. Just determine whether to delete the object inside the mutex as a simple fix. Store that info in a local variable and handle accordingly after releasing the mutex.
Concerning your third question, turning the count into an atomic is not going to fix things magically. Also, the point behind atomics is that you don't need a mutex, because locking a mutex is an expensive operation. With atomics, you can combine operations like decrement and check for zero, which is similar to the fix proposed above. Atomics are typically slower than "normal" integers. They are still faster than a mutex though.
In both cases there’s a data race. Thread 1 decrements the counter to 1, and just before the if statement a thread switch occurs. Thread 2 decrement the counter to 0 and then deletes the object. Thread 1 resumes, sees that count is 0, and deletes the object again.
Move the unlock() to the end of th function.or, better, use std::lock_guard to do the lock; its destructor will unlock the mutex even when the delete call throws an exception.
If two threads potentially* enter deref() concurrently, then, regardless of the previous or previously expected value of count, a data race occurs, and your entire program, even the parts that you would expect to be chronologically prior, has undefined behavior as stated in the C++ standard in [intro.multithread/20] (N4659):
Two actions are potentially concurrent if
(20.1) they are performed by different threads, or
(20.2) they are unsequenced, at least one is performed by a signal handler, and they are not both performed by the same signal handler invocation.
The execution of a program contains a data race if it contains two potentially concurrent conflicting actions, at least one of which is
not atomic, and neither happens before the other, except for the special case for signal handlers described below. Any such data race results in undefined behavior.
The potentially concurrent actions in this case, of course, are the read of count outside of the locked section, and the write of count within it.
*) That is, if current inputs allow it.
UPDATE 1: The section you reference, describing atomic memory order, explains how atomic operations synchronize with each other and with other synchronization primitives (such as mutexes and memory barriers). In other words, it describes how atomics can be used for synchronization so that some operations aren't data races. It does not apply here. The standard takes a conservative approach here: Unless other parts of the standard explicitly make clear that two conflicting accesses are not concurrent, you have a data race, and hence UB (where conflicting means same memory location, and at least one of them isn't read-only).
Your lock prevents that operation count-- gets in a mess when performed concurrently in different threads. It does not guarantee, however, that the values of count are synchronized, such that repeated reads outside a single critical section will bear the risk of a data race.
You could rewrite it as follows:
void deref()
{
bool isLast;
m_Mutx.lock();
--count;
isLast = (count == 0);
m_Mutx.unlock();
if (isLast) {
delete rawObj;
}
}
Thereby, the lock makes sure that access to count is synchronized and always in a valid state. This valid state is carried over to the non-critical section through a local variable (without race condition). Thereby, the critical section can be kept rather short.
A simpler version would be to synchronize the complete function body; this might get a disadvantage if you want to do more elaborative things than just delete rawObj:
void deref()
{
std::lock_guard<std::mutex> lock(m_Mutx);
if (! --count) {
delete rawObj;
}
}
BTW: std::atomic allone will not solve this issue as this synchronizes just each single access, but not a "transaction". Therefore, your scoped_lock is necessary, and - as this spans the complete function then - the std::atomic becomes superfluous.

possible concurrent write of *same* value to an integer. Do I need an atomic variable?

I want to introduce an optimization into some legacy code. The optimization boils down to the following simple example:
class Foo{
static int m_count; // allocated and initialized to -1 to indicate it's uninitialized.
void fun(){
if (m_count ==-1)
m_count = execute_db_call(); // return a val > 0.
if (m_count == 1) {
// call special == 1 optimized code.
} else {
// call expensive code.
}
}
}
fun() will be called millions of times on hundreds of threads, all running concurrently on a 256 core server.
execute_db_call is expensive, and the return value is constant for the lifetime of the application.
Do I need or want to make m_count atomic? In the worst case multiple threads might call execute_db_call, and get the same value, and then write this value to the same location in memory. Is that a race condition even when both threads attempt to write the same integer value?
If I did make the member atomic, what kind of performance overhead am I looking at for the subsequent read only behavior?
Per standard §1.10/21:
The execution of a program contains a data race if it contains two conflicting actions in different threads, at least one of which is not atomic, and neither happens before the other. Any such data race results in undefined behavior.
It looks like your code matches this definition, so you will get UB. Now, even assuming your application will never crush (oh well...) you might get unnecessary execute_db_call calls, and you have explicitly stated that "execute_db_call is expensive", so it's still bad.
You haven't said what properties, if any, you need to be assured m_count has after the process has completed, or whether it matters what its value happens to be at any stage during the processing. In the absence of such properties, you might as well not assign it a value at all. If you do need the value of m_count to be anything other than undefined at any point in your processing, you assuredly must use synchronization to ensure that the sequence of actions taken in relation to that variable remains consistent.
If m_count is only written in the -1 case, and the variable is shared by all threads, you would be vastly better off moving that initialization outside of the parallel part of your program. You could thus eliminate any data races and avoid unnecessarily calling you expensive function.

When will variables be removed by optimization?

I'm working with threads and I have question about how the compilers are allowed to optimize the following code:
void MyClass::f(){
Parent* p = this->m_parent;
this->m_done = true;
p->function();
}
It is very important that p (on the stack or in a register) is used to call the function instead of this->m_parent. Since as soon as m_done becomes true then this may be deleted from another thread if it happens to run its cleanup pass (I have had actual crashes due to this), in which case m_parent could contain garbage, but the thread stack/registers would be intact.
My initial testing on GCC/Linux shows that I don't have a race condition but I would like to know if this will be the case on other compilers as well?
Is this a case for gulp volatile? I've looked at What kinds of optimizations does 'volatile' prevent in C++? and Is 'volatile' needed in this multi-threaded C++ code? but I didn't feel like either of them applied to my problem.
I feel uneasy relying on this as I'm unsure what the compiler is allowed to do here. I see the following cases:
No/Beneficial optimization, the pointer in this->m_parent is stored on the stack/register and this value is later used to call function() this is wanted behavior.
Compiler removes p but this->m_parent is by chance available in a register already and the compiler uses this for the call to function() would work but be unreliable between compilers. This is bad as it could expose bugs moving to a different platform/compiler version.
Compiler removes p and reads this->m_parent before the call to function(), this creates a race condition which I can't have.
Could some one please shed some light on what the compiler is allowed to do here?
Edit
I forgot to mention that this->m_done is a std::atomic<bool> and I'm using C++11.
This code will work perfectly as written in C++11 if m_done is std::atomic<bool>. The read of m_parent is sequenced-before the write to m_done which will synchronize with a hypothetical read of m_done in another thread that is sequenced-before a hypothetical write to m_parent. Taken together, that means that the standard guarantees that this thread's read of m_parent happens-before the other thread's write.
You might run into a re-ordering problem. Check memory barriers to solve this problem. Place it so that the loading of p and the setting of m_done is done in exactly that order (place it between both instructions) and you should be fine.
Implementations are available in C++11 and boost.
Um, some other thread might delete the object out from under you while you're in one of its member functions? Undefined behavior. Plain and simple.

Dangers of simultaneous write and read of a boolean in a simple situation

I've read some similar questions but the situations described there are bit more complicated.
I have a bool b initialized as false in the heap and two threads. I do understand that operations with bools are not atomic, but please read the question till the end.
First thread can set b = true only once and doesn't do anything else with it.
Second thread checks b in a loop and if it's true does some actions.
Do I need to use some synchronization mechanism(like mutexes) to protect b?
What can happen if I don't? With ints I can obviously get arbitrary values when I read and write in the same time. But with bools there are just true and false and I don't mind to once get false instead of true. Is it a potential SIGSEGV?
Data races result in undefined behavior. As far as the standard is concerned, a conforming implementation is permitted to segfault.
In practice the main danger is that without synchronization, the compiler will observe enough of the code in the reader loop to judge that b "never changes", and optimize out all but the first read of the value. It can do this because if it observes that there is no synchronization in the loop, then it knows that any write to the value would be a data race. The optimizer is permitted to assume that your program does not provoke undefined behavior, so it is permitted to assume that there are no writes from other threads.
Marking b as volatile will prevent this particular optimization in practice, but even on volatile objects data races are undefined behavior. Calling into code that the optimizer "can't see" will also prevent the optimization in practice, since it doesn't know whether that code modifies b. Of course with link-time/whole-program optimization there is less that the optimizer can't see, than with compile-time-only optimization.
Anyway, preventing the optimization from being made in software doesn't prevent the equivalent thing happening in hardware on a system with non-coherent caches (at least, so I claim: other people argue that this is not correct, and that volatile accesses are required to read/write through caches. Some implementations do behave that way). If you're asking about what the standard says then it doesn't really matter whether or not the hardware shows you a stale cache indefinitely, since behavior remains undefined and so the implementation can break your code regardless of whether this particular optimization is the thing that breaks it.
The problem you might get is that we don't know how long it takes for the reader thread to see the changed value. If they are on different CPUs, with separate caches, there are no guarantees unless you use a memory barrier to synchronize the caches.
On an x86 this is handled automatically by the hardware protocol, but not on some other systems.
Do I need to use some synchronization mechanism(like mutexes) to protect b?
If you don't, you have a data race. Programs with data races have undefined behaviour. The answer to this question is the same as the answer to the question "Do you want your program to have well-defined behaviour?"
What can happen if I don't?
Theoretically, anything can happen. That's what undefined behaviour means. The most likely bad thing that can happen is that the "second thread" may never see a true value.
The compiler can assume that a program has no data races (if it has the behaviour is not defined by the standard, so behaving as if it didn't is fine). Since the second thread only ever reads from a variable that has the value false, and there's no synchronization that affects those reads, the logical conclusion is that the value never changes, and thus the loop is infinite. (and some infinite loops have undefined behaviour in C++11!)
Here are a few alternative solutions:
Use a Mutex, details have been covered in the other answers above.
Consider using a read/write lock which will manage/protect simultaneous reads and writes. The pthread lib provides an implementation: pthread_rwlock_t
Depending on what your application is doing, consider using a condition variable (pthread lib impl: pthread_cond_t). This is effectively a signal from one thread to another, which could allow you to remove your while loop and bool checking.
Making the boolean volatile will suffice (on x86 architecture), no mutex needed:
volatile bool b;

Is it ok to read a shared boolean flag without locking it when another thread may set it (at most once)?

I would like my thread to shut down more gracefully so I am trying to implement a simple signalling mechanism. I don't think I want a fully event-driven thread so I have a worker with a method to graceully stop it using a critical section Monitor (equivalent to a C# lock I believe):
DrawingThread.h
class DrawingThread {
bool stopRequested;
Runtime::Monitor CSMonitor;
CPInfo *pPInfo;
//More..
}
DrawingThread.cpp
void DrawingThread::Run() {
if (!stopRequested)
//Time consuming call#1
if (!stopRequested) {
CSMonitor.Enter();
pPInfo = new CPInfo(/**/);
//Not time consuming but pPInfo must either be null or constructed.
CSMonitor.Exit();
}
if (!stopRequested) {
pPInfo->foobar(/**/);//Time consuming and can be signalled
}
if (!stopRequested) {
//One more optional but time consuming call.
}
}
void DrawingThread::RequestStop() {
CSMonitor.Enter();
stopRequested = true;
if (pPInfo) pPInfo->RequestStop();
CSMonitor.Exit();
}
I understand (at least in Windows) Monitor/locks are the least expensive thread synchronization primitive but I am keen to avoid overuse. Should I be wrapping each read of this boolean flag? It is initialized to false and only set once to true when stop is requested (if it is requested before the task completes).
My tutors advised to protect even bool's because read/writing may not be atomic. I think this one shot flag is the exception that proves the rule?
It is never OK to read something possibly modified in a different thread without synchronization. What level of synchronization is needed depends on what you are actually reading. For primitive types, you should have a look at atomic reads, e.g. in the form of std::atomic<bool>.
The reason synchronization is always needed is that the processors will have the data possibly shared in a cache line. It has no reason to update this value to a value possibly changed in a different thread if there is no synchronization. Worse, yet, if there is no synchronization it may write the wrong value if something stored close to the value is changed and synchronized.
Boolean assignment is atomic. That's not the problem.
The problem is that a thread may not not see changes to a variable done by a different thread due to either compiler or CPU instruction reordering or data caching (i.e. the thread that reads the boolean flag may read a cached value, instead of the actual updated value).
The solution is a memory fence, which indeed is implicitly added by lock statements, but for a single variable it's overkill. Just declare it as std::atomic<bool>.
The answer, I believe, is "it depends." If you're using C++03, threading isn't defined in the Standard, and you'll have to read what your compiler and your thread library say, although this kind of thing is usually called a "benign race" and is usually OK.
If you're using C++11, benign races are undefined behavior. Even when undefined behavior doesn't make sense for the underlying data type. The problem is that compilers can assume that programs have no undefined behavior, and make optimizations based on that (see also the Part 1 and Part 2 linked from there). For instance, your compiler could decide to read the flag once and cache the value because it's undefined behavior to write to the variable in another thread without some kind of mutex or memory barrier.
Of course, it may well be that your compiler promises to not make that optimization. You'll need to look.
The easiest solution is to use std::atomic<bool> in C++11, or something like Hans Boehm's atomic_ops elsewhere.
No, you have to protect every access, since modern compilers and cpus reorder the code without your multithreading tasks in mind. The read access from different threads might work, but don't have to work.