Lightweight spinlocks built from GCC atomic operations?

Lightweight spinlocks built from GCC atomic operations? - c++

I'd like to minimize synchronization and write lock-free code when possible in a project of mine. When absolutely necessary I'd love to substitute light-weight spinlocks built from atomic operations for pthread and win32 mutex locks. My understanding is that these are system calls underneath and could cause a context switch (which may be unnecessary for very quick critical sections where simply spinning a few times would be preferable).
The atomic operations I'm referring to are well documented here: http://gcc.gnu.org/onlinedocs/gcc-4.4.1/gcc/Atomic-Builtins.html
Here is an example to illustrate what I'm talking about. Imagine a RB-tree with multiple readers and writers possible. RBTree::exists() is read-only and thread safe, RBTree::insert() would require exclusive access by a single writer (and no readers) to be safe. Some code:
class IntSetTest
{
private:
unsigned short lock;
RBTree<int>* myset;
public:
// ...
void add_number(int n)
{
// Aquire once locked==false (atomic)
while (__sync_bool_compare_and_swap(&lock, 0, 0xffff) == false);
// Perform a thread-unsafe operation on the set
myset->insert(n);
// Unlock (atomic)
__sync_bool_compare_and_swap(&lock, 0xffff, 0);
}
bool check_number(int n)
{
// Increment once the lock is below 0xffff
u16 savedlock = lock;
while (savedlock == 0xffff || __sync_bool_compare_and_swap(&lock, savedlock, savedlock+1) == false)
savedlock = lock;
// Perform read-only operation
bool exists = tree->exists(n);
// Decrement
savedlock = lock;
while (__sync_bool_compare_and_swap(&lock, savedlock, savedlock-1) == false)
savedlock = lock;
return exists;
}
};
(lets assume it need not be exception-safe)
Is this code indeed thread-safe? Are there any pros/cons to this idea? Any advice? Is the use of spinlocks like this a bad idea if the threads are not truly concurrent?
Thanks in advance. ;)

You need a volatile qualifier on lock, and I would also make it a sig_atomic_t. Without the volatile qualifier, this code:
u16 savedlock = lock;
while (savedlock == 0xffff || __sync_bool_compare_and_swap(&lock, savedlock, savedlock+1) == false)
savedlock = lock;
may not re-read lock when updating savedlock in the body of the while-loop. Consider the case that lock is 0xffff. Then, savedlock will be 0xffff prior to checking the loop condition, so the while condition will short-circuit prior to calling __sync_bool_compare_and_swap. Since __sync_bool_compare_and_swap wasn't called, the compiler doesn't encounter a memory barrier, so it might reasonably assume that the value of lock hasn't changed underneath you, and avoid re-loading it in savedlock.
Re: sig_atomic_t, there's a decent discussion here. The same considerations that apply to signal handlers would also apply to threads.
With these changes, I'd guess that your code would be thread-safe. I would still recommend using mutexes, though, since you really don't know how long your RB-tree insert will take in the general case (per my previous comments under the question).

It may be worth noting that if you're using the Win32 mutexes, that from Vista onwards a thread pool is provided for you. Depending on what you use the RB tree for, you could replace with that.
Also, what you should remember is that atomic operations are not particularly fast. Microsoft said they were a couple hundred cycles, each.
Rather than trying to "protect" the function in this way, it would likely be much more efficient to simply synchronize the threads, either changing to a SIMD/thread pool approach, or to just use a mutex.
But, of course, without seeing your code, I can't really make any more comments. The trouble with multithreading is that you have to see someone's whole model to understand it.

Related

C++ atomics: how to allow only a single thread to access a function?

I'd like to write a function that is accessible only by a single thread at a time. I don't need busy waits, a brutal 'rejection' is enough if another thread is already running it. This is what I have come up with so far:
std::atomic<bool> busy (false);
bool func()
{
if (m_busy.exchange(true) == true)
return false;
// ... do stuff ...
m_busy.exchange(false);
return true;
}
Is the logic for the atomic exchange correct?
Is it correct to mark the two atomic operations as std::memory_order_acq_rel? As far as I understand a relaxed ordering (std::memory_order_relaxed) wouldn't be enough to prevent reordering.

Your atomic swap implementation might work. But trying to do thread safe programming without a lock is most always fraught with issues and is often harder to maintain.
Unless there's a performance improvement that's needed, then std::mutex with the try_lock() method is all you need, eg:
std::mutex mtx;
bool func()
{
// making use of std::unique_lock so if the code throws an
// exception, the std::mutex will still get unlocked correctly...
std::unique_lock<std::mutex> lck(mtx, std::try_to_lock);
bool gotLock = lck.owns_lock();
if (gotLock)
{
// do stuff
}
return gotLock;
}

Your code looks correct to me, as long as you leave the critical section by falling out, not returning or throwing an exception.
You can unlock with a release store; an RMW (like exchange) is unnecessary. The initial exchange only needs acquire. (But does need to be an atomic RMW like exchange or compare_exchange_strong)
Note that ISO C++ says that taking a std::mutex is an "acquire" operation, and releasing is is a "release" operation, because that's the minimum necessary for keeping the critical section contained between the taking and the releasing.
Your algo is exactly like a spinlock, but without retry if the lock's already taken. (i.e. just a try_lock). All the reasoning about necessary memory-order for locking applies here, too. What you've implemented is logically equivalent to the try_lock / unlock in #selbie's answer, and very likely performance-equivalent, too. If you never use mtx.lock() or whatever, you're never actually blocking i.e. waiting for another thread to do something, so your code is still potentially lock-free in the progress-guarantee sense.
Rolling your own with an atomic<bool> is probably good; using std::mutex here gains you nothing; you want it to be doing only this for try-lock and unlock. That's certainly possible (with some extra function-call overhead), but some implementations might do something more. You're not using any of the functionality beyond that. The one nice thing std::mutex gives you is the comfort of knowing that it safely and correctly implements try_lock and unlock. But if you understand locking and acquire / release, it's easy to get that right yourself.
The usual performance reason to not roll your own locking is that mutex will be tuned for the OS and typical hardware, with stuff like exponential backoff, x86 pause instructions while spinning a few times, then fallback to a system call. And efficient wakeup via system calls like Linux futex. All of this is only beneficial to the blocking behaviour. .try_lock leaves that all unused, and if you never have any thread sleeping then unlock never has any other threads to notify.
There is one advantage to using std::mutex: you can use RAII without having to roll your own wrapper class. std::unique_lock with the std::try_to_lock policy will do this. This will make your function exception-safe, making sure to always unlock before exiting, if it got the lock.

How do I safely write a test and test-and-set (TATAS) spinlock with C++11 for x86(-64)?

I'm currently working on a Spinlock class and trying to make it as reasonable as possible, mostly based on advice here: https://software.intel.com/en-us/articles/implementing-scalable-atomic-locks-for-multi-core-intel-em64t-and-ia32-architectures
The work-in-progress looks like this:
class Spinlock
{
public:
Spinlock() : m_lock(false) {}
void lock()
{
// heavy test here for the usual uncontested case
bool exp = false;
if (!m_lock.compare_exchange_weak(exp, true, std::memory_order_acquire))
{
int failCount = 0;
for (;;)
{
// processor spin loop hint
_mm_pause();
// spin on mov instead of lock instruction
if (!m_lock.load(std::memory_order_relaxed))
{
// heavy test now that we suspect success
exp = false;
if (m_lock.compare_exchange_weak(exp, true, std::memory_order_acquire))
{
return;
}
}
// backoff (potentially make exponential later)
if (++failCount == SOME_VALUE)
{
// Yield somehow.
failCount = 0;
}
}
}
}
void unlock()
{
m_lock.store(false, std::memory_order_release);
}
std::atomic_bool m_lock;
};
However, it seems like having that relaxed read in there can theoretically allow generated code to do unexpected things like create deadlocks: http://joeduffyblog.com/2009/02/23/the-magical-dueling-deadlocking-spin-locks/
This code shouldn't deadlock in the same way as the linked example because the outer acquire should keep the relaxed load from drifting behind, but I don't really have a handle on all the code transformations that could exist. What memory orders and/or fences do I need to keep this code safe without losing performance? Is it possible for a backoff implementation to occur significantly more or less frequently (> a few loops) than intended because the surrounding memory orders are too relaxed?
On a related note, why are spinlock examples around the web using acquire/release memory order for spinlocks instead of sequentially consistent? I found a comment saying that allowing a spinlock release to cross a later spinlock acquire could lead to problems: http://preshing.com/20120913/acquire-and-release-semantics/#IDComment721195810

This code shouldn't deadlock in the same way as the linked example because the outer acquire should keep the relaxed load from drifting behind, but I don't really have a handle on all the code transformations that could exist.
Acquire operation guarantees that subsequent reads will not be reordered prior to the acquire operation by both compiler and the CPU.
What memory orders and/or fences do I need to keep this code safe without losing performance?
You do not need any extra synchronization here, your code does the right thing.
why are spinlock examples around the web using acquire/release memory order for spinlocks instead of sequentially consistent?
Because acquire/release semantics are enough to implement a mutex. On some architecture sequential consistency operations are more expensive than acquire/release.
I cannot recommend enough watching atomic<> Weapons: The C++ Memory Model and Modern Hardware, it covers this subject in great detail.

How to make a multiple-read/single-write lock from more basic synchronization primitives?

We have found that we have several spots in our code where concurrent reads of data protected by a mutex are rather common, while writes are rare. Our measurements seem to say that using a simple mutex seriously hinders the performance of the code reading that data. So what we would need is a multiple-read/single-write mutex. I know that this can be built atop of simpler primitives, but before I try myself at this, I'd rather ask for existing knowledge:
What is an approved way to build a multiple-read/single-write lock out of simpler synchronization primitives?
I do have an idea how to make it, but I'd rather have answers unbiased by what I (probably wrongly) came up with. (Note: What I expect is an explanation how to do it, probably in pseudo code, not a full-fledged implementation. I can certainly write the code myself.)
Caveats:
This needs to have reasonable performance. (What I have in mind would require two lock/unlock operations per access. Now that might not be good enough, but needing many of them instead seems unreasonable.)
Commonly, reads are more numerous, but writes are more important and performance-sensitive than reads. Readers must not starve writers.
We are stuck on a rather old embedded platform (proprietary variant of VxWorks 5.5), with a rather old compiler (GCC 4.1.2), and boost 1.52 – except for most of boost's parts relying on POSIX, because POSIX isn't fully implemented on that platform. The locking primitives available basically are several kind of semaphores (binary, counting etc.), on top of which we have already created mutexes, conditions variables, and monitors.
This is IA32, single-core.

At first glance I thought I recognized this answer as the same algorithm that Alexander Terekhov introduced. But after studying it I believe that it is flawed. It is possible for two writers to simultaneously wait on m_exclusive_cond. When one of those writers wakes and obtains the exclusive lock, it will set exclusive_waiting_blocked = false on unlock, thus setting the mutex into an inconsistent state. After that, the mutex is likely hosed.
N2406, which first proposed std::shared_mutex contains a partial implementation, which is repeated below with updated syntax.
class shared_mutex
{
mutex mut_;
condition_variable gate1_;
condition_variable gate2_;
unsigned state_;
static const unsigned write_entered_ = 1U << (sizeof(unsigned)*CHAR_BIT - 1);
static const unsigned n_readers_ = ~write_entered_;
public:
shared_mutex() : state_(0) {}
// Exclusive ownership
void lock();
bool try_lock();
void unlock();
// Shared ownership
void lock_shared();
bool try_lock_shared();
void unlock_shared();
};
// Exclusive ownership
void
shared_mutex::lock()
{
unique_lock<mutex> lk(mut_);
while (state_ & write_entered_)
gate1_.wait(lk);
state_ |= write_entered_;
while (state_ & n_readers_)
gate2_.wait(lk);
}
bool
shared_mutex::try_lock()
{
unique_lock<mutex> lk(mut_, try_to_lock);
if (lk.owns_lock() && state_ == 0)
{
state_ = write_entered_;
return true;
}
return false;
}
void
shared_mutex::unlock()
{
{
lock_guard<mutex> _(mut_);
state_ = 0;
}
gate1_.notify_all();
}
// Shared ownership
void
shared_mutex::lock_shared()
{
unique_lock<mutex> lk(mut_);
while ((state_ & write_entered_) || (state_ & n_readers_) == n_readers_)
gate1_.wait(lk);
unsigned num_readers = (state_ & n_readers_) + 1;
state_ &= ~n_readers_;
state_ |= num_readers;
}
bool
shared_mutex::try_lock_shared()
{
unique_lock<mutex> lk(mut_, try_to_lock);
unsigned num_readers = state_ & n_readers_;
if (lk.owns_lock() && !(state_ & write_entered_) && num_readers != n_readers_)
{
++num_readers;
state_ &= ~n_readers_;
state_ |= num_readers;
return true;
}
return false;
}
void
shared_mutex::unlock_shared()
{
lock_guard<mutex> _(mut_);
unsigned num_readers = (state_ & n_readers_) - 1;
state_ &= ~n_readers_;
state_ |= num_readers;
if (state_ & write_entered_)
{
if (num_readers == 0)
gate2_.notify_one();
}
else
{
if (num_readers == n_readers_ - 1)
gate1_.notify_one();
}
}
The algorithm is derived from an old newsgroup posting of Alexander Terekhov. It starves neither readers nor writers.
There are two "gates", gate1_ and gate2_. Readers and writers have to pass gate1_, and can get blocked in trying to do so. Once a reader gets past gate1_, it has read-locked the mutex. Readers can get past gate1_ as long as there are not a maximum number of readers with ownership, and as long as a writer has not gotten past gate1_.
Only one writer at a time can get past gate1_. And a writer can get past gate1_ even if readers have ownership. But once past gate1_, a writer still does not have ownership. It must first get past gate2_. A writer can not get past gate2_ until all readers with ownership have relinquished it. Recall that new readers can't get past gate1_ while a writer is waiting at gate2_. And neither can a new writer get past gate1_ while a writer is waiting at gate2_.
The characteristic that both readers and writers are blocked at gate1_ with (nearly) identical requirements imposed to get past it, is what makes this algorithm fair to both readers and writers, starving neither.
The mutex "state" is intentionally kept in a single word so as to suggest that the partial use of atomics (as an optimization) for certain state changes is a possibility (i.e. for an uncontended "fast path"). However that optimization is not demonstrated here. One example would be if a writer thread could atomically change state_ from 0 to write_entered then he obtains the lock without having to block or even lock/unlock mut_. And unlock() could be implemented with an atomic store. Etc. These optimizations are not shown herein because they are much harder to implement correctly than this simple description makes it sound.

It seems like you only have mutex and condition_variable as synchronization primitives. therefore, I write a reader-writer lock here, which starves readers. it uses one mutex, two conditional_variable and three integer.
readers - readers in the cv readerQ plus the reading reader
writers - writers in cv writerQ plus the writing writer
active_writers - the writer currently writing. can only be 1 or 0.
It starve readers in this way. If there are several writers want to write, readers will never get the chance to read until all writers finish writing. This is because later readers need to check writers variable. At the same time, the active_writers variable will guarantee that only one writer could write at a time.
class RWLock {
public:
RWLock()
: shared()
, readerQ(), writerQ()
, active_readers(0), waiting_writers(0), active_writers(0)
{}
void ReadLock() {
std::unique_lock<std::mutex> lk(shared);
while( waiting_writers != 0 )
readerQ.wait(lk);
++active_readers;
lk.unlock();
}
void ReadUnlock() {
std::unique_lock<std::mutex> lk(shared);
--active_readers;
lk.unlock();
writerQ.notify_one();
}
void WriteLock() {
std::unique_lock<std::mutex> lk(shared);
++waiting_writers;
while( active_readers != 0 || active_writers != 0 )
writerQ.wait(lk);
++active_writers;
lk.unlock();
}
void WriteUnlock() {
std::unique_lock<std::mutex> lk(shared);
--waiting_writers;
--active_writers;
if(waiting_writers > 0)
writerQ.notify_one();
else
readerQ.notify_all();
lk.unlock();
}
private:
std::mutex shared;
std::condition_variable readerQ;
std::condition_variable writerQ;
int active_readers;
int waiting_writers;
int active_writers;
};

Concurrent reads of data protected by a mutex are rather common, while writes are rare
That sounds like an ideal scenario for User-space RCU:
URCU is similar to its Linux-kernel counterpart, providing a replacement for reader-writer locking, among other uses. This similarity continues with readers not synchronizing directly with RCU updaters, thus making RCU read-side code paths exceedingly fast, while furthermore permitting RCU readers to make useful forward progress even when running concurrently with RCU updaters—and vice versa.

There's some good tricks you can do to help.
First, good performance. VxWorks is notable for its very good context switch times. Whatever the locking solution you use it will likely involve semaphores. I wouldn't be afraid of using semaphores (plural) for this, they're pretty well optimsed in VxWorks, and the fast context switch times help mimimise the degradation in performance from assessing many semaphore states, etc.
Also I would forget using POSIX semaphores, which are simply going to be layered on top of VxWork's own semaphores. VxWorks provices native counting, binary and mutex semaphores; using the one that suits makes it all a bit faster. The binary ones can be quite useful sometimes; posted to many times, never exceed the value of 1.
Second, writes being more important than reads. When I've had this kind of requirement in VxWorks and have been using a semaphore(s) to control access, I've used task priority to indicate which task is more important and should get first access to the resource. This works quite well; literally everything in VxWorks is a task (well, thread) like any other, including all the device drivers, etc.
VxWorks also resolves priority inversions (the kind of thing that Linus Torvalds hates). So if you implement your locking with a semaphore(s), you can rely on the OS scheduler to chivvy up lower priority readers if they're blocking a higher priority writer. It can lead to much simpler code, and you're getting the most of the OS too.
So a potential solution is to have a single VxWorks counting semaphore protecting the resource, initialised to a value equal to the number of readers. Each time a reader wants to read, it takes the semaphore (reducing the count by 1. Each time a read is done it posts the semaphore, increasing the count by 1. Each time the writer wants to write it takes the semaphore n (n = number of readers) times, and posts it n times when done. Finally make the writer task of higher priority than any of the readers, and rely on the OS fast context switch time and priority inversion.
Remember that you're programming on a hard-realtime OS, not Linux. Taking / posting a native VxWorks semaphore doesn't involve the same amount of runtime as a similar act on Linux, though even Linux is pretty good these days (I'm using PREEMPT_RT nowadays). The VxWorks scheduler and all the device drivers can be relied upon to behave. You can even make your writer task the highest priority in the whole system if you wish, higher even than all the device drivers!
To help things along, also consider what it is that each of your threads are doing. VxWorks allows you to indicate that a task is/isn't using the FPU. If you're using native VxWorks TaskSpawn routines instead of pthread_create then you get an opportunity to specify this. What it means is that if your thread/task isn't doing any floating point maths, and you've said as such in your call to TaskSpawn, the context switch times will be even faster because the scheduler won't bother to preserve / restore the FPU state.
This stands a reasonable chance of being the best solution on the platform you're developing on. It's playing to the OS's strengths (fast semaphores, fast context switch times) without introducing a load of extra code to recreate an alternate (and possibly more elegant) solution commonly found on other platforms.
Third, stuck with old GCC and old Boost. Basically I can't help there other than low value suggestions about phoning up WindRiver and discussing buying an upgrade. Personally speaking when I've been programming for VxWorks I've used VxWork's native API rather than POSIX. Ok, so the code hasn't be very portable, but it has ended up being fast; POSIX is merely layer on top of the native API anyway and that will always slow things down.
That said, POSIX counting and mutex semaphores are very similar to VxWork's native counting and mutex semaphores. That probably means that the POSIX layering isn't very thick.
General Notes About Programming for VxWorks
Debugging I always sought to use the development tools (Tornado) available for Solaris. This is by far the best multi-threaded debugging environment I've ever come across. Why? It allows you to start up multiple debug sessions, one for each thread/task in the system. You end up with a debug window per thread, and you are individually and independently debugging each one. Step over a blocking operation, that debug window gets blocked. Move mouse focus to another debugging window, step over the operation that will release the block and watch the first window complete its step.
You end up with a lot of debug windows, but it's by far the best way to debug multi-threaded stuff. It made it veeeeery easy to write really quite complex stuff and see problems. You can easily explore the different dynamic interactions in your application because you had simple and all powerful control over what each thread is doing at any time.
Ironically the Windows version of Tornado didn't let you do this; one miserable single debug windows per system, just like any other boring old IDE such as Visual Studio, etc. I've never seen even modern IDEs come anywhere close to being as good as Tornado on Solaris for multi-threaded debugging.
HardDrives If your readers and writers are using files on disk, consider that VxWorks 5.5 is pretty old. Things like NCQ aren't going to be supported. In this case my proposed solution (outlined above) might be better done with a single mutex semaphore to stop multiple readers tripping over each other in their struggle to read different parts of the disk. It depends on what exactly your readers are doing, but if they're reading contiguous data from a file this would avoid thrashing the read/write head to and fro across the disk surface (very slow).
In my case I was using this trick to shape traffic across a network interface; each task was sending a different sort of data, and the task priority reflected the priority of the data on the network. It was very elegant, no message was ever fragmented, but the important messages got the lions share of the available bandwidth.

As always the best solution will depend on details. A read-write spin lock may be what you're looking for, but other approaches such as read-copy-update as suggested above might be a solution - though on an old embedded platform the extra memory used might be an issue. With rare writes I often arrange the work using a tasking system such that the writes can only occur when there are no reads from that data structure, but this is algorithm dependent.

One algorithm for this based on semaphores and mutexes is described in Concurrent Control with Readers and Writers; P.J. Courtois, F. Heymans, and D.L. Parnas; MBLE Research Laboratory; Brussels, Belgium.

This is a simplified answer based on my Boost headers (I would call Boost an approved way). It only requires Condition Variables and Mutexes. I rewrote it using Windows primitives because I find them descriptive and very simple, but view this as Pseudocode.
This is a very simple solution, which does not support things like mutex upgrading, or try_lock() operations. I can add those if you want. I also took out some frills like disabling interrupts that aren't strictly necessary.
Also, it's worth checking out boost\thread\pthread\shared_mutex.hpp (this being based on that). It's human-readable.
class SharedMutex {
CRITICAL_SECTION m_state_mutex;
CONDITION_VARIABLE m_shared_cond;
CONDITION_VARIABLE m_exclusive_cond;
size_t shared_count;
bool exclusive;
// This causes write blocks to prevent further read blocks
bool exclusive_wait_blocked;
SharedMutex() : shared_count(0), exclusive(false)
{
InitializeConditionVariable (m_shared_cond);
InitializeConditionVariable (m_exclusive_cond);
InitializeCriticalSection (m_state_mutex);
}
~SharedMutex()
{
DeleteCriticalSection (&m_state_mutex);
DeleteConditionVariable (&m_exclusive_cond);
DeleteConditionVariable (&m_shared_cond);
}
// Write lock
void lock(void)
{
EnterCriticalSection (&m_state_mutex);
while (shared_count > 0 || exclusive)
{
exclusive_waiting_blocked = true;
SleepConditionVariableCS (&m_exclusive_cond, &m_state_mutex, INFINITE)
}
// This thread now 'owns' the mutex
exclusive = true;
LeaveCriticalSection (&m_state_mutex);
}
void unlock(void)
{
EnterCriticalSection (&m_state_mutex);
exclusive = false;
exclusive_waiting_blocked = false;
LeaveCriticalSection (&m_state_mutex);
WakeConditionVariable (&m_exclusive_cond);
WakeAllConditionVariable (&m_shared_cond);
}
// Read lock
void lock_shared(void)
{
EnterCriticalSection (&m_state_mutex);
while (exclusive || exclusive_waiting_blocked)
{
SleepConditionVariableCS (&m_shared_cond, m_state_mutex, INFINITE);
}
++shared_count;
LeaveCriticalSection (&m_state_mutex);
}
void unlock_shared(void)
{
EnterCriticalSection (&m_state_mutex);
--shared_count;
if (shared_count == 0)
{
exclusive_waiting_blocked = false;
LeaveCriticalSection (&m_state_mutex);
WakeConditionVariable (&m_exclusive_cond);
WakeAllConditionVariable (&m_shared_cond);
}
else
{
LeaveCriticalSection (&m_state_mutex);
}
}
};
Behavior
Okay, there is some confusion about the behavior of this algorithm, so here is how it works.
During a Write Lock - Both readers and writers are blocked.
At the end of a Write Lock - Reader threads and one writer thread will race to see which one starts.
During a Read Lock - Writers are blocked. Readers are also blocked if and only if a Writer is blocked.
At the release of the final Read Lock - Reader threads and one writer thread will race to see which one starts.
This could cause readers to starve writers if the processor frequently context switches over to a m_shared_cond thread before an m_exclusive_cond during notification, but I suspect that issue is theoretical and not practical since it's Boost's algorithm.

Now that Microsoft has opened up the .NET source code, you can look at their ReaderWRiterLockSlim implementation.
I'm not sure the more basic primitives they use are available to you, some of them are also part of the .NET library and their code is also available.
Microsoft has spent quite a lot of time on improving the performance of their locking mechanisms, so this can be a good starting point.

Avoiding data race of boolean variables with pthreads

in my code I have the following structure:
Parent thread
somedatatype thread1_continue, thread2_continue; // Does bool guarantee no data race?
Thread 1:
while (thread1_continue) {
// Do some work
}
Thread 2:
while (thread2_continue) {
// Do some work
}
So I wonder which data type should be thread1_continue or thread2_continue to avoid data race. And also if there is any data type or technique in pthread to solve this problem.

There is no built-in basic type that guarantees thread safety, no matter how small. Even if you are working with bool or unsigned char, neither reading nor writing is guaranteed to be atomic. In other words: there is a chance that if more threads are independantly working with the same memory, one thread can overwrite this memory only partially while the other reads the trash value ~ in that case the behavior is undefined.
You could use mutex to wrap the critical section with lock and unlock calls to ensure the mutual exclusion - there will be only 1 thread that will be able to execute that code. For more sophisticated synchronization there are semaphores, condition variables or even patterns / idioms describing how the synchronization can be handled using these (light switch, turniket, etc.). Just study more about these, some simple examples can be found here :)
Note that there might be some more complex types / wrappers available that wrap the way the object is being accessed - such as std::atomic template in C++11, which does nothing but internally handles the synchronization for you so that you don't need to do that explicitly. With std::atomic there is a guarantee that: "if one thread writes to an atomic object while another thread reads from it, the behavior is well-defined".

For booleans (and others), be sure to avoid
thread 1 loop
{
do actions1;
myFlag = true;
do more1;
}
thread 2 loop
{
do actions2;
if (myFlag)
{
myFlag = false;
do flagged actions;
}
do more2;
}
This nearly always works until myBool is set by thread1 while thread2 is in between checking and resetting myBool. There are CPU-dependent primitives to handle test-and-set, but the normal solution is lock when accessing shared resources, even booleans.

One reader. One writer. Some general questions about mutexes and atomic-builtins

I have a parent and a worker thread that share a bool flag and a std::vector. The parent only reads (i.e., reads the bool or calls my_vector.empty()); the worker only writes.
My questions:
Do I need to mutex protect the bool flag?
Can I say that all bool read/writes are inherently atomic operations? If you say Yes or No, where did you get your information from?
I recently heard about GCC Atomic-builtin. Can I use these to make my flag read/writes atomic without having to use mutexes? What is the difference? I understand Atomic builtins boil down to machine code, but even mutexes boil down to CPU's memory barrier instructions right? Why do people call mutexes an "OS-level" construct?
Do I need to mutex protect my std::vector? Recall that the worker thread populates this vector, whereas the parent only calls empty() on it (i.e., only reads it)
I do not believe mutex protection is necessary for either the bool or the vector. I rationalize as follows, "Ok, if I read the shared memory just before it was updated.. thats still fine, I will get the updated value the next time around. More importantly, I do not see why the writer should be blocked while the reading is reading, because afterall, the reader is only reading!"
If someone can point me in the right direction, that would be just great. I am on GCC 4.3, and Intel x86 32-bit.
Thanks a lot!

Do I need to mutex protect the bool flag?
Not necessarily, an atomic instruction would do. By atomic instruction I mean a compiler intrinsic function that a) prevents compiler reordering/optimization and b) results in atomic read/write and c) issues an appropriate memory fence to ensure visibility between CPUs (not necessary for current x86 CPUs which employ MESI cache coherency protocol). Similar to gcc atomic builtins.
Can I say that all bool read/writes are inherently atomic operations? If you say Yes or No, where did you get your information from?
Depends on the CPU. For Intel CPUs - yes. See Intel® 64 and IA-32 Architectures Software Developer's Manuals.
I recently heard about GCC Atomic-builtin. Can I use these to make my flag read/writes atomic without having to use mutexes? What is the difference? I understand Atomic builtins boil down to machine code, but even mutexes boil down to CPU's memory barrier instructions right? Why do people call mutexes an "OS-level" construct?
The difference between atomics and mutexes is that the latter can put the waiting thread to sleep until the mutex is released. With atomics you can only busy-spin.
Do I need to mutex protect my std::vector? Recall that the worker thread populates this vector, whereas the parent only calls empty() on it (i.e., only reads it)
You do.
I do not believe mutex protection is necessary for either the bool or the vector. I rationalize as follows, "Ok, if I read the shared memory just before it was updated.. thats still fine, I will get the updated value the next time around. More importantly, I do not see why the writer should be blocked while the reading is reading, because afterall, the reader is only reading!"
Depending on the implementation, vector.empty() may involve reading two buffer begin/end pointers and subtracting or comparing them, hence there is a chance that you read a new version of one pointer and an old version of another one without a mutex. Surprising behaviour may ensue.

From the C++11 standards point of view, you have to protect the bool with a mutex, or alternatively use std::atomic<bool>. Even when you are sure that your bool is read and written to atomically anyways, there is still the chance that the compiler can optimize away accesses to it because it does not know about other threads that could potentially access it.
If for some reason you absolutely need the latest bit of performance of your platform, consider reading the "Intel 64 and IA-32 Architectures Software Developer's Manual", which will tell you how things work under the hood on your architecture. But of course, this will make your program unportable.

Answers:
You will need to protect the bool (or any other variable for that matter) that has the possibility of being operated on by two or more threads at the same time. You can either do this with a mutex or by operating on the bool atomically.
Bool reads and bool writes may be atomic operations, but two sequential operations are certainly not (e.g., a read and then a write). More on this later.
Atomic builtins provide a solution to the problem above: the ability to read and write a variable in a step that cannot be interrupted by another thread. This makes the operation atomic.
If you are using the bool flag as your 'mutex' (that is, only the thread that sets the bool flag to true has permission to modify the vector) then you're OK. The mutual exclusion is managed by the boolean, and as long as you're modifying the bool using atomic operations you should be all set.
To answer this, let me use an example:
bool flag(false);
std::vector<char> my_vector;
while (true)
{
if (flag == false) // check to see if the mutex is owned
{
flag = true; // obtain ownership of the flag (the mutex)
// manipulate the vector
flag = false; // release ownership of the flag
}
}
In the above code in a multithreaded environment it is possible for the thread to be preempted between the if statement (the read) and the assignment (the write), which means it possible for two (or more) threads with this kind of code to both "own" the mutex (and the rights to the vector) at the same time. This is why atomic operations are crucial: they ensure that in the above scenario the flag will only be set by one thread at a time, therefore ensuring the vector will only be manipulated by one thread at a time.
Note that setting the flag back to false need not be an atomic operation because you this instance is the only one with rights to modify it.
A rough (read: untested) solution may look something like:
bool flag(false);
std::vector<char> my_vector;
while (true)
{
// check to see if the mutex is owned and obtain ownership if possible
if (__sync_bool_compare_and_swap(&flag, false, true))
{
// manipulate the vector
flag = false; // release ownership of the flag
}
}
The documentation for the atomic builtin reads:
The “bool” version returns true if the comparison is successful and newval was written.
Which means the operation will check to see if flag is false and if it is set the value to true. If the value was false true is returned, otherwise false. All of this happens in an atomic step, so it is guaranteed not to be preempted by another thread.

I don't have the expertise to answer your entire question but your last bullet is incorrect in cases in which reads are non-atomic by default.
A context switch can happen anywhere, the reader can get context switched partway through a read, the writer can get switched in and do the full write, and then the reader would finish their read. The reader would see neither the first value, nor the second value, but potentially some wildly inaccurate intermediate value.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js