Is Thread Sanitizer expected to be able to correctly analyze lock-free code? - c++

Background motivation: I have some code that uses a lock-free algorithm to share audio data to/from a CoreAudio callback (only because CoreAudio callbacks-threads are real-time and therefore aren't allowed to lock mutexes). This code seems to work fine, but if I run it under Clang's Thread Sanitizer tool, some race-condition diagnostics are reported.
My question is: to what extent is Thread Sanitizer expected to be able to correctly reason about race conditions in the context of lock-free code? i.e. can it reliably tell the difference between a buggy lock-free algorithm that has a genuine race condition and a correctly-written lock-free algorithm that does not, or is it expected that the Thread Sanitizer will just say "hey, you wrote to this data structure in thread A and later read from it in thread B, and no mutex-locking was ever observed, so I'm going to print a diagnostic about that"?
If Thread Sanitizer is able to correctly analyze lock-free algorithms, any related information about how it does that, and/or how the lock-free algorithm might be tuned/annotated to make Thread Sanitizer's diagnoses more accurate would be appreciated.

As far as I know thread sanitizer should, setting aside bugs, not produce false positives with the caveats that
C++ exceptions are not supported,
fences are not properly supported,
all code, including the standard library, needs to be compiled with TSAN instrumentation, and
for synchronization, directly or indirectly, only pthreads primitives and compiler built-in atomics may be used.
Of course, it can only detect data races and similar UB situations and only in execution paths taken. It cannot generally recognize race conditions that result in unintended behavior or evaluate whether a data structure is thread-safe.

Related

Is C++ std::atomic compatible with pthreads?

I have 2 pthread threads where one is writing a bool value and another is reading it.
I dont care for portability. Its x86 architecture. The only which concerns me is writing thread sets bool to true and starts doing its own work (which happens once a day at midnight) closing a file. And the other thread had read the bool as false and proceeds with its work (writing to a file) at the same time. Its very difficult to reproduce this scenario so I better get best possible theoretical solution.
Can I use std::atomic in case of pthreads?
Can I use std::atomic in case of pthreads?
Yes, that's what std::atomic is for.
It works with std::thread, POSIX threads, and any other kind of threads. Behind the scenes it uses "magical" compiler annotations to prevent certain thread-incompatible optimizations, and processor-specific locking instructions to guarantee that thread-safe code is generated1.
It makes (almost) no sense to use std::atomic without threads (you could use std::atomic instead of volatile for signal handlers, but there is no advantage in doing so).
The only which concerns me ...
The rest of your question makes no sense to me.
1 When used correctly, which is often non-trivial thing to do, and which is why you generally should try not to use std::atomic unless you are an expert.

How compiler like GCC implement acquire/release semantics for std::mutex

My understanding is that std::mutex lock and unlock have a acquire/release semantics which will prevent instructions between them from being moved outside.
So acquire/release should disable both compiler and CPU reorder instructions.
My question is that I take a look at GCC5.1 code base and don't see anything special in std::mutex::lock/unlock to prevent compiler reordering codes.
I find a potential answer in does-pthread-mutex-lock-have-happens-before-semantics which indicates a mail that says a external function call act as compiler memory fences.
Is it always true? And where is the standard?
Threads are a fairly complicated, low-level feature. Historically, there was no standard C thread functionality, and instead it was done differently on different OS's. Today there is mainly the POSIX threads standard, which has been implemented in Linux and BSD, and now by extension OS X, and there are Windows threads, starting with Win32 and on. Potentially, there could be other systems besides these.
GCC doesn't directly contain a POSIX threads implementation, instead it may be a client of libpthread on a linux system. When you build GCC from source, you have to configure and build separately a number of ancillary libraries, supporting things like big numbers and threads. That is the point at which you select how threading will be done. If you do it the standard way on linux, you will have an implementation of std::thread in terms of pthreads.
On windows, starting with MSVC C++11 compliance, the MSVC devs implemented std::thread in terms of the native windows threads interface.
It's the OS's job to ensure that the concurrency locks provided by their API actually works -- std::thread is meant to be a cross-platform interface to such a primitive.
The situation may be more complicated for more exotic platforms / cross-compiling etc. For instance, in MinGW project (gcc for windows) -- historically, you have the option to build MinGW gcc using either a port of pthreads to windows, or using a native win32 based threading model. If you don't configure this when you build, you may end up with a C++11 compiler which doesn't support std::thread or std::mutex. See this question for more details. MinGW error: ‘thread’ is not a member of ‘std’
Now, to answer your question more directly. When a mutex is engaged, at the lowest level, this involves some call into libpthreads or some win32 API.
pthread_lock_mutex();
do_some_stuff();
pthread_unlock_mutex();
(The pthread_lock_mutex and pthread_unlock_mutex correspond to the implementations of lock and unlock of std::mutex on your platform, and in idiomatic C++11 code, these are in turn called in the ctor and dtor of std::unique_lock for instance if you are using that.)
Generally, the optimizer cannot reorder these unless it is sure that pthread_lock_mutex() has no side-effects that can change the observable behavior of do_some_stuff().
To my knowledge, the mechanism the compiler has for doing this is ultimately the same as what it uses for estimating the potential side-effects of calls to any other external library.
If there is some resource
int resource;
which is in contention among various threads, it means that there is some function body
void compete_for_resource();
and a function pointer to this is at some earlier point passed to pthread_create... in your program in order to initiate another thread. (This would presumably be in the implementation of the ctor of std::thread.) At this point, the compiler can see that any call into libpthread can potentially call compete_for_resource and touch any memory that that function touches. (From the compiler's point of view libpthread is a black box -- it is some .dll / .so and it can't make assumptions about what exactly it does.)
In particular, the call pthread_lock_mutex(); potentially has side-effects for resource, so it cannot be re-ordered against do_some_stuff().
If you never actually spawn any other threads, then to my knowledge, do_some_stuff(); could be reordered outside of the mutex lock. Since, then libpthread doesn't have any access to resource, it's just a private variable in your source and isn't shared with the external library even indirectly, and the compiler can see that.
All of these questions stem from the rules for compiler reordering. One of the fundamental rules for reordering is that the compiler must prove that the reorder does not change the result of the program. In the case of std::mutex, the exact meaning of that phrase is specified in a block of about 10 pages of legaleese, but the general intuitive sense of "doesn't change the result of the program" holds. If you had a guarantee about which operation came first, according to the specification, no compiler is allowed to reorder in a way which violates that guarantee.
This is why people often claim that a "function call acts as a memory barrier." If the compiler cannot deep-inspect the function, it cannot prove that the function didn't have a hidden barrier or atomic operation inside of it, thus it must treat that function as though it was a barrier.
There is, of course, the case where the compiler can inspect the function, such as the case of inline functions or link time optimizations. In these cases, one cannot rely on a function call to act as a barrier, because the compiler may indeed have enough information to prove the rewrite behaves the same as the original.
In the case of mutexes, even such advanced optimization cannot take place. The only way to reorder around the mutex lock/unlock function calls is to have deep-inspected the functions and proven there are no barriers or atomic operations to deal with. If it can't inspect every sub-call and sub-sub-call of that lock/unlock function, it can't prove it is safe to reorder. If it indeed can do this inspection, it would see that every mutex implementation contains something which cannot be reordered around (indeed, this is part of the definition of a valid mutex implementation). Thus, even in that extreme case, the compiler is still forbidden from optimizing.
EDIT: For completeness, I would like to point out that these rules were introduced in C++11. C++98 and C++03 reordering rules only prohibited changes that affected the result of the current thread. Such a guarantee is not strong enough to develop multithreading primitives like mutexes.
To deal with this, multithreading APIs like pthreads developed their own rules. from the Pthreads specification section 4.11:
Applications shall ensure that access to any memory location by more
than one thread of control (threads or processes) is restricted such
that no thread of control can read or modify a memory location while
another thread of control may be modifying it. Such access is
restricted using functions that synchronize thread execution and also
synchronize memory with respect to other threads. The following
functions synchronize memory with respect to other threads
It then lists a few dozen functions which synchronize memory, including pthread_mutex_lock and pthread_mutex_unlock.
A compiler which wishes to support the pthreads library must implement something to support this cross-thread memory synchronization, even though the C++ specification didn't say anything about it. Fortunately, any compiler where you want to do multithreading was developed with the recognition that such guarantees are fundamental to all multithreading, so every compiler that supports multithreading has it!
In the case of gcc, it did so without any special notes on the pthreads function calls because gcc would effectively create a barrier around every external function call (because it couldn't prove that no synchronization existed inside that function call). If gcc were to ever change that, they would also have to change their pthreads headers to include any extra verbage needed to mark the pthreads functions as synchronizing memory.
All of that, of course, is compiler specific. There were no standards answers to this question until C++11 came along with its new memory model.
NOTE: I am no expert in this area and my knowledge about it is in a spaghetti like condition. So take the answer with a grain of salt.
NOTE-2: This might not be the answer that OP is expecting. But here are my 2 cents anyways if it helps:
My question is that I take a look at GCC5.1 code base and don't see
anything special in std::mutex::lock/unlock to prevent compiler
reordering codes.
g++ using pthread library. std::mutex is just a thin wrapper around pthread_mutex. So, you will have to actually go and have a look at pthread's mutex implementation.
If you go bit deeper into the pthread implementation (which you can find here), you will see that it uses atomic instructions along with futex calls.
Two minor things to remember here:
1. The atomic instructions do use barriers.
2. Any function call is equivalent to full barrier. Do not remember from where I read it.
3. mutex calls may put the thread to sleep and cause context switch.
Now, as far as reordering goes, one of the things that needs to be guaranteed is that, no instruction after lock and before unlock should be reordered to before lock or after unlock. This I believe is not a full-barrier, but rather just acquire and release barrier respectively. But, this is again platform dependent, x86 provides sequential consistency by default whereas ARM provides a weaker ordering guarantee.
I strongly recommend this blog series:
http://preshing.com/archives/
It explains lots of lower level stuff in easy to understand language. Guess, I have to read it once again :)
UPDATE:: Unable to comment on #Cort Ammons answer due to length
#Kane I am not sure about this, but people in general write barriers for processor level which takes care of compiler level barriers as well. The same is not true for compiler builtin barriers.
Now, since the pthread_*lock* functions definitions are not present in the translation unit where you are making use of it (this is doubtful), calling lock - unlock should provide you with full memory barrier. The pthread implementation for the platform makes use of atomic instructions to block any other thread from accessing the memory locations after the lock or before unlock. Now since only one thread is executing the critical portion of the code it is ensured that any reordering within that will not change the expected behaviour as mentioned in above comment.
Atomics is pretty tough to understand and to get right, so, what I have written above is from my understanding. Would be very glad to know if my understanding is wrong here.
So acquire/release should disable both compiler and CPU reorder instructions.
By definition anything that prevents CPU reordering by speculative execution prevents compiler reordering. That's the definition of language semantics, even without MT (multi-threading) in the language, so you will be safe from reordering on old compilers that don't support MT.
But these compilers aren't safe for MT for a bunch of reasons, from the lack of thread protection around runtime initialization of static variables to the implicitly modified global variables like errno, etc.
Also, in C/C++, any call to a function that is purely external (that is: not inline, available for inlining at any point), without annotation explaining what it does (like the "pure function" attribute of some popular compiler), must be assumed to do anything that legal C/C++ code can do. No non trivial reordering would be possible (any reordering that is visible is non trivial).
Any correct implementation of locks on systems with multiple units of execution that don't simulate a global order on assembly instructions will require memory barriers and will prevent reordering.
An implementation of locks on a linearly executing CPU, with only one unit of execution (or where all threads are bound on the same unit of execution), might use only volatile variables for synchronisation and that is unsafe as volatile reads resp. writes do not provide any guarantee of acquire resp. release of any other data (contrast Java). Some kind of compiler barrier would be needed, like a strongly external function call, or some asm (""/*nothing*/) (which is compiler specific and even compiler version specific).

Is the meaning of "lock-free" even defined by the C++ standard?

I can't find a semantic difference between lock-based and lock-free atomics. So far as I can tell, the difference is semantically meaningless as far as the language is concerned, since the language doesn't provide any timing guarantees. The only guarantees I can find are memory ordering guarantees, which seem to be the same for both cases.
(How) can the lock-free-ness of atomics affect program semantics?
i.e., aside from calling is_lock_free or atomic_is_lock_free, is it possible to write a well-defined program whose behavior is actually affected by whether atomics are lock-free?
Do those functions even have a semantic meaning? Or are they just practical hacks for writing responsive programs even though the language never provides timing guarantees in the first place?
There is at least one semantic difference.
As per C++11 1.9 Program execution /6:
When the processing of the abstract machine is interrupted by receipt of a signal, the values of objects which are neither of type volatile std::sig_atomic_t nor lock-free atomic objects are unspecified during the execution of the signal handler, and the value of any object not in either of
these two categories that is modified by the handler becomes undefined.
In other words, it's safe to muck about with those two categories of variables but any access or modification to all other categories should be avoided.
Of course you could argue that it's no longer a well defined program if you invoke unspecified/undefined behaviour like that but I'm not entirely sure whether you meant that or well-formed (i.e., compilable).
But, even if you discount that semantic difference, a performance difference is worth having. If I had to have a value for communicating between threads, I'd probably choose, in order of preference:
the smallest adequate data type that was lock-free.
a larger data type than necessary, if it was lock-free and the smaller one wasn't.
a shared region fully capable of race conditions, but in conjunction with an atomic_flag (guaranteed to be lock-free) to control access.
This behaviour could be selected at compile or run time based on the ATOMIC_x_LOCK_FREE macros so that, even though the program behaves the same regardless, the optimal method for that behaviour is chosen.
In C++11 Standard, the term "lock-free" was not defined well as reported in issue LWG #2075.
C++14 Standard define what lock-free executions is in C++ language (N3927 approved).
Quote C++14 1.10[intro.multithread]/paragraph 4:
Executions of atomic functions that are either defined to be lock-free (29.7) or indicated as lock-free (29.4) are lock-free executions.
If there is only one unblocked thread, a lock-free execution in that thread shall complete. [ Note: Concurrently executing threads may prevent progress of a lock-free execution. For example, this situation can occur with load-locked store-conditional implementations. This property is sometimes termed obstruction-free. -- end note ]
When one or more lock-free executions run concurrently, at least one should complete. [ Note: It is difficult for some implementations to provide absolute guarantees to this effect, since repeated and particularly inopportune interference from other threads may prevent forward progress, e.g., by repeatedly stealing a cache line for unrelated purposes between load-locked and store-conditional instructions. Implementations should ensure that such effects cannot indefinitely delay progress under expected operating conditions, and that such anomalies can therefore safely be ignored by programmers. Outside this International Standard, this property is sometimes termed lock-free. -- end note ]
Above definition of "lock-free" depends on what does unblocked thread behave. C++ Standard does not define unblocked thread directly, but 17.3.3[defns.blocked] defines blocked thread:
a thread that is waiting for some condition (other than the availability of a processor) to be satisfied before it can continue execution
(How) can the lock-free-ness of atomics affect program semantics?
I think the answer is NO, except signal handler as paxdiablo's answer, when "program semantics" mean the side effects of atomic operations.
The lock-free-ness of atomic affect the strength of progress guarantee for whole multithreading program.
When two (or more) threads concurrently execute lock-free atomic operations on same object, at least one of these operations should complete under any worst thread scheduling.
In other words, 'evil' thread scheduler could intentionally block progress of lock-based atomic operations in theory.
Paxdiablo has answered pretty well, but some background might help.
"Lock-free atomic" is a bit of redundant terminology. The point of atomic variables, as they were originally invented, is to avoid locks by leveraging hardware guarantees. But, each platform has its own limitations, and C++ is highly portable. So the implementation has to emulate atomicity (usually via the library) using fine-grained locks for atomic types that don't really exist at the hardware level.
Behavioral differences are minimized between hardware atomics and "software atomics," because differences would mean lost portability. On the other hand, a program should be able to avoid accidentally using mutexes, hence introspection through ATOMIC_x_LOCK_FREE which is available to the preprocessor.

Do I need Read-Write lock here

I have writing a multi threaded code. I am not sure, whether I would need a read and write lock mechanism. Could you please go through the usecase and tell me do I have to use read-write lock or just normal mutex will do.
Use case:
1) Class having two variables. These are accessed by every thread before doing operation.
2) When something goes wrong, these variables are updated to reflect the error scenarios.
Thus threads reading these variables can take different decisions (including abort)
Here, in second point, I need to update the data. And in first point, every thread will use the data. So, my question is do I have to use write lock while updating data and read lock while reading the data. (Note: Both variables are in memory. Just a boolean flag & string)
I am confused because as my both vars are in memory. So does OS take care when integrity. I mean I can live with 1 or 2 threads missing the updated value when some thread is writing the data in mutex.
Please tell if I am right or wrong? Also please tell If I have to use read-write lock or just normal mutex would do.
Update: I am sorry that I did not give platform and compiler name. I am on RHEL 5.0 and using gcc 4.6. My platform is x86_64. But I don not want my code to be OS specific because we are going to port the code shortly to Solaris 10.
First off, ignore those other answerers talking about volatile. Volatile is almost useless for multithreaded programming, and any false sense of safety given by it is just that - false.
Now, whether you need a lock depends on what you're doing with these variables. You will need a memory barrier at least (locks imply one).
So let's give an example:
One flag is an error flag. If zero, you continue, otherwise, you abort.
Another flag is a diagnostic code flag. It gives the precise reason for the error.
In this case, one option would be to do the following:
Read the error flag without a lock, but with read memory barriers after the read.
When an error occurs, take a lock, set the diagnostic code and error flags, then release the lock. If the diagnostic code is already set, release the lock immediately.
The memory barriers are needed, as otherwise the compiler (or CPU!) may choose to cache the same result for every read.
Of course, if the semantics of your two variables are different, the answer may vary. You'll need to be more specific.
Note that the exact mechanism for specifying locks and memory barriers depends on the compiler. C++0x provides a portable mechanism, but few compilers fully implement the C++0x standard yet. Please specify your compiler and OS for a more detailed answer.
As for your output data, you will almost certainly need a lock there. Try to avoid taking these locks too often though, as too much lock contention will kill your performance.
If they are atomic variables (C1x stdatomic.h or C++0x atomic), then you don't need read/write locks. With earlier C/C++ standards, there's no portable way of using multiple threads at all, so you need to look into how the implementation you are using does things. In most implementations, data types that can be accessed with a single machine instruction are atomic.
Note that just having an atomic variable is not enough -- you probably also need to declare it as volatile to guarantee that the compiler does not do things that will cause you to miss updates from other threads.
Thus threads reading these variables can take different decisions (including abort)
So each thread need to ensure that it reads the updated data. Also since the variables are shared, you need to take care about the race condition as well.
So in short - you need to use read/write locks when reading and writing to these shared variables.
See if you can use volatile variables - that should save you from using locks when you read the values (however write should still be with locks). This is applicable only because you said that -
I mean I can live with 1 or 2 threads missing the updated value when some thread is writing the data in mutex

Thread Safety Testing

I currently use googles' gtest to write my unit tests, but it doesn't look like it can test thread-safety (that is, accessing something from multiple threads and ensuring it behaves according to spec).
What do you use to test thread safety? I'd like something cross-platform, but, it definitely has to work on Windows atleast.
Thank you!
If multithreaded code isn't immediately, obviously, provably correct then it is almost certainly wrong. And if it is, you don't need to test it.
Seriously: shared mutable state should be extremely localised and rare, and the classes that do it should be demonstrably correct.
Your threads should normally interact via safe primitives (eg a thread-safe work queue). If you have lots of data structures scattered around your code each with its own locking strategy then your code almost certainly contains deadlocks and race conditions. A big testing effort will only find some of the problems.