What lock-free primitives do people actually use to do lock-free audio processing in c++?

What lock-free primitives do people actually use to do lock-free audio processing in c++? - c++

Anyone who has done a bit of low level audio programming has been cautioned about locking the audio thread. Here's a nice article on the subject.
But it's very unclear to me how to actually go about making a multithreaded audio processing application in c++ while strictly following this rule and ensuring thread safety. Assume that you're building something simple like a visualizer. You need to hand off audio data to a UI thread to process it and display it on a regular interval.
My first attempt at this would be to ping pong between two buffers. buffer*_write_state is a boolean type assumed to be atomic and lock free. buffer* is some kind of buffer with no expectation of being thread safe on its own and with some means of handing the case where one thread gets called at an insufficient rate (I don't mean to get into the complications of that here). For a generic looking boolean type, the implementation looks like this:
// Write thread.
if (buffer1_write_state) {
buffer1.write(data);
if (buffer2_write_state) {
buffer1_write_state = false;
}
} else {
buffer2.write(data);
if (buffer1_write_state) {
buffer2_write_state = false;
}
}
// Read thread.
if (buffer1_write_state) {
data = buffer2.read();
buffer2.clear();
buffer2_write_state = true;
} else if (buffer2_write_state) {
data = buffer1.read();
buffer1.clear();
buffer1_write_state = true;
}
I've implemented this using std::atomic_flag as my boolean type and as far as I can tell with my thread sanitizer, it is thread safe. std::atomic_flag is guaranteed to be lock free by the standard. The point that confuses me is that to even do this, I need std::atomic_flag's test() function which doesn't exist prior to c++20. The available, mutating test_and_set() and clear() functions don't do the job. Well known alternative std::atomic is not guaranteed to be lock-free by the standard. I've heard that it most cases it isn't.
I've read a few threads that caution people against rolling their own attempts at a lock-free structure, and I'm happy to abide by that tip, but how do experts even build these things if the basic tools aren't guaranteed to be lock-free?

I've heard that it most cases it isn't.
You heard wrong.
std::atomic<bool> is lock_free on all "normal" C++ implementations, e.g. for ARM, x86, PowerPC, etc. Use it if atomic_flag's restrictive API sucks too much. Or std::atomic<int>, also pretty universally lock_free on targets that have lock-free anything.
(The only plausible exception would be an 8-bit machine that can't load/store/RMW a pair of bytes.)
Note that if you're targeting ARM, you should enable compiler options to let it know you don't care about ARM CPUs too old to support atomic operations. In that case, the compiler will have to make slow code that uses library function calls in case it runs on ARMv4 or something. See std::atomic<bool> lock-free inconsistency on ARM (raspberry pi 3)

Related

Efficient way to have a thread wait for a value to change in memory?

For some silly reason, there's a piece of hardware on my (GNU/Linux) machine that can only communicate a certain occurrence by writing a value to memory. Assume that by some magic, the area of memory the hardware writes to is visible to a process I'm running. Now, I want to have a thread within that process keep track of that value, and as soon as possible after it has changed - execute some code. However, it is more important to me that the thread not waste CPU time than for it to absolutely minimize the response delay. So - no busy-waiting on a volatile...
How should I best do this (using modern C++)?
Notes:
I don't mind a solution involving atomics, or synchronization mechanisms (in fact, that would perhaps be preferable) - as long as you bear in mind that the hardware doesn't support atomic operations on host memory - it performs a plain write.
The value the hardware writes can be whatever I like, as can the initial value in the memory location it writes to.
I used C++11 since it's the popular tag for Modern C++, but really, C++14 is great and C++17 is ok. On the other hand, even a C-based solution will do.

So, the naive thing to do would be non-busy sleeping, e.g.:
volatile int32_t* special_location = get_special_location();
auto polling_interval_in_usec = perform_tradeoff_between_accuracy_and_cpu_load();
auto polling_interval = std::chrono::microseconds(polling_interval_in_usec);
while(should_continue_polling()) {
if (*special_location == HardwareIsDone) {
do_stuff();
return;
}
std::this_thread::sleep_for(polling_interval);
}

This is usually done via std::condition_variable.
... as long as you bear in mind that the hardware doesn't support atomic operations on host memory - it performs a plain write.
Implementations of std::atomic may fall back to mutexes in such cases
UPD - Possible implementation details: assuming you have some data structure in a form of:
struct MyData {
std::mutex mutex;
std::condition_variable cv;
some_user_type value;
};
and you have an access to it from several processes. Writer process overrides value and notifies cv via notify_one, reader process waits on cv in a somewhat similar to busy wait manner, but thread yields for the wait duration. Everything else I could add is already present in the referred examples.

How to make a multiple-read/single-write lock from more basic synchronization primitives?

We have found that we have several spots in our code where concurrent reads of data protected by a mutex are rather common, while writes are rare. Our measurements seem to say that using a simple mutex seriously hinders the performance of the code reading that data. So what we would need is a multiple-read/single-write mutex. I know that this can be built atop of simpler primitives, but before I try myself at this, I'd rather ask for existing knowledge:
What is an approved way to build a multiple-read/single-write lock out of simpler synchronization primitives?
I do have an idea how to make it, but I'd rather have answers unbiased by what I (probably wrongly) came up with. (Note: What I expect is an explanation how to do it, probably in pseudo code, not a full-fledged implementation. I can certainly write the code myself.)
Caveats:
This needs to have reasonable performance. (What I have in mind would require two lock/unlock operations per access. Now that might not be good enough, but needing many of them instead seems unreasonable.)
Commonly, reads are more numerous, but writes are more important and performance-sensitive than reads. Readers must not starve writers.
We are stuck on a rather old embedded platform (proprietary variant of VxWorks 5.5), with a rather old compiler (GCC 4.1.2), and boost 1.52 – except for most of boost's parts relying on POSIX, because POSIX isn't fully implemented on that platform. The locking primitives available basically are several kind of semaphores (binary, counting etc.), on top of which we have already created mutexes, conditions variables, and monitors.
This is IA32, single-core.

At first glance I thought I recognized this answer as the same algorithm that Alexander Terekhov introduced. But after studying it I believe that it is flawed. It is possible for two writers to simultaneously wait on m_exclusive_cond. When one of those writers wakes and obtains the exclusive lock, it will set exclusive_waiting_blocked = false on unlock, thus setting the mutex into an inconsistent state. After that, the mutex is likely hosed.
N2406, which first proposed std::shared_mutex contains a partial implementation, which is repeated below with updated syntax.
class shared_mutex
{
mutex mut_;
condition_variable gate1_;
condition_variable gate2_;
unsigned state_;
static const unsigned write_entered_ = 1U << (sizeof(unsigned)*CHAR_BIT - 1);
static const unsigned n_readers_ = ~write_entered_;
public:
shared_mutex() : state_(0) {}
// Exclusive ownership
void lock();
bool try_lock();
void unlock();
// Shared ownership
void lock_shared();
bool try_lock_shared();
void unlock_shared();
};
// Exclusive ownership
void
shared_mutex::lock()
{
unique_lock<mutex> lk(mut_);
while (state_ & write_entered_)
gate1_.wait(lk);
state_ |= write_entered_;
while (state_ & n_readers_)
gate2_.wait(lk);
}
bool
shared_mutex::try_lock()
{
unique_lock<mutex> lk(mut_, try_to_lock);
if (lk.owns_lock() && state_ == 0)
{
state_ = write_entered_;
return true;
}
return false;
}
void
shared_mutex::unlock()
{
{
lock_guard<mutex> _(mut_);
state_ = 0;
}
gate1_.notify_all();
}
// Shared ownership
void
shared_mutex::lock_shared()
{
unique_lock<mutex> lk(mut_);
while ((state_ & write_entered_) || (state_ & n_readers_) == n_readers_)
gate1_.wait(lk);
unsigned num_readers = (state_ & n_readers_) + 1;
state_ &= ~n_readers_;
state_ |= num_readers;
}
bool
shared_mutex::try_lock_shared()
{
unique_lock<mutex> lk(mut_, try_to_lock);
unsigned num_readers = state_ & n_readers_;
if (lk.owns_lock() && !(state_ & write_entered_) && num_readers != n_readers_)
{
++num_readers;
state_ &= ~n_readers_;
state_ |= num_readers;
return true;
}
return false;
}
void
shared_mutex::unlock_shared()
{
lock_guard<mutex> _(mut_);
unsigned num_readers = (state_ & n_readers_) - 1;
state_ &= ~n_readers_;
state_ |= num_readers;
if (state_ & write_entered_)
{
if (num_readers == 0)
gate2_.notify_one();
}
else
{
if (num_readers == n_readers_ - 1)
gate1_.notify_one();
}
}
The algorithm is derived from an old newsgroup posting of Alexander Terekhov. It starves neither readers nor writers.
There are two "gates", gate1_ and gate2_. Readers and writers have to pass gate1_, and can get blocked in trying to do so. Once a reader gets past gate1_, it has read-locked the mutex. Readers can get past gate1_ as long as there are not a maximum number of readers with ownership, and as long as a writer has not gotten past gate1_.
Only one writer at a time can get past gate1_. And a writer can get past gate1_ even if readers have ownership. But once past gate1_, a writer still does not have ownership. It must first get past gate2_. A writer can not get past gate2_ until all readers with ownership have relinquished it. Recall that new readers can't get past gate1_ while a writer is waiting at gate2_. And neither can a new writer get past gate1_ while a writer is waiting at gate2_.
The characteristic that both readers and writers are blocked at gate1_ with (nearly) identical requirements imposed to get past it, is what makes this algorithm fair to both readers and writers, starving neither.
The mutex "state" is intentionally kept in a single word so as to suggest that the partial use of atomics (as an optimization) for certain state changes is a possibility (i.e. for an uncontended "fast path"). However that optimization is not demonstrated here. One example would be if a writer thread could atomically change state_ from 0 to write_entered then he obtains the lock without having to block or even lock/unlock mut_. And unlock() could be implemented with an atomic store. Etc. These optimizations are not shown herein because they are much harder to implement correctly than this simple description makes it sound.

It seems like you only have mutex and condition_variable as synchronization primitives. therefore, I write a reader-writer lock here, which starves readers. it uses one mutex, two conditional_variable and three integer.
readers - readers in the cv readerQ plus the reading reader
writers - writers in cv writerQ plus the writing writer
active_writers - the writer currently writing. can only be 1 or 0.
It starve readers in this way. If there are several writers want to write, readers will never get the chance to read until all writers finish writing. This is because later readers need to check writers variable. At the same time, the active_writers variable will guarantee that only one writer could write at a time.
class RWLock {
public:
RWLock()
: shared()
, readerQ(), writerQ()
, active_readers(0), waiting_writers(0), active_writers(0)
{}
void ReadLock() {
std::unique_lock<std::mutex> lk(shared);
while( waiting_writers != 0 )
readerQ.wait(lk);
++active_readers;
lk.unlock();
}
void ReadUnlock() {
std::unique_lock<std::mutex> lk(shared);
--active_readers;
lk.unlock();
writerQ.notify_one();
}
void WriteLock() {
std::unique_lock<std::mutex> lk(shared);
++waiting_writers;
while( active_readers != 0 || active_writers != 0 )
writerQ.wait(lk);
++active_writers;
lk.unlock();
}
void WriteUnlock() {
std::unique_lock<std::mutex> lk(shared);
--waiting_writers;
--active_writers;
if(waiting_writers > 0)
writerQ.notify_one();
else
readerQ.notify_all();
lk.unlock();
}
private:
std::mutex shared;
std::condition_variable readerQ;
std::condition_variable writerQ;
int active_readers;
int waiting_writers;
int active_writers;
};

Concurrent reads of data protected by a mutex are rather common, while writes are rare
That sounds like an ideal scenario for User-space RCU:
URCU is similar to its Linux-kernel counterpart, providing a replacement for reader-writer locking, among other uses. This similarity continues with readers not synchronizing directly with RCU updaters, thus making RCU read-side code paths exceedingly fast, while furthermore permitting RCU readers to make useful forward progress even when running concurrently with RCU updaters—and vice versa.

There's some good tricks you can do to help.
First, good performance. VxWorks is notable for its very good context switch times. Whatever the locking solution you use it will likely involve semaphores. I wouldn't be afraid of using semaphores (plural) for this, they're pretty well optimsed in VxWorks, and the fast context switch times help mimimise the degradation in performance from assessing many semaphore states, etc.
Also I would forget using POSIX semaphores, which are simply going to be layered on top of VxWork's own semaphores. VxWorks provices native counting, binary and mutex semaphores; using the one that suits makes it all a bit faster. The binary ones can be quite useful sometimes; posted to many times, never exceed the value of 1.
Second, writes being more important than reads. When I've had this kind of requirement in VxWorks and have been using a semaphore(s) to control access, I've used task priority to indicate which task is more important and should get first access to the resource. This works quite well; literally everything in VxWorks is a task (well, thread) like any other, including all the device drivers, etc.
VxWorks also resolves priority inversions (the kind of thing that Linus Torvalds hates). So if you implement your locking with a semaphore(s), you can rely on the OS scheduler to chivvy up lower priority readers if they're blocking a higher priority writer. It can lead to much simpler code, and you're getting the most of the OS too.
So a potential solution is to have a single VxWorks counting semaphore protecting the resource, initialised to a value equal to the number of readers. Each time a reader wants to read, it takes the semaphore (reducing the count by 1. Each time a read is done it posts the semaphore, increasing the count by 1. Each time the writer wants to write it takes the semaphore n (n = number of readers) times, and posts it n times when done. Finally make the writer task of higher priority than any of the readers, and rely on the OS fast context switch time and priority inversion.
Remember that you're programming on a hard-realtime OS, not Linux. Taking / posting a native VxWorks semaphore doesn't involve the same amount of runtime as a similar act on Linux, though even Linux is pretty good these days (I'm using PREEMPT_RT nowadays). The VxWorks scheduler and all the device drivers can be relied upon to behave. You can even make your writer task the highest priority in the whole system if you wish, higher even than all the device drivers!
To help things along, also consider what it is that each of your threads are doing. VxWorks allows you to indicate that a task is/isn't using the FPU. If you're using native VxWorks TaskSpawn routines instead of pthread_create then you get an opportunity to specify this. What it means is that if your thread/task isn't doing any floating point maths, and you've said as such in your call to TaskSpawn, the context switch times will be even faster because the scheduler won't bother to preserve / restore the FPU state.
This stands a reasonable chance of being the best solution on the platform you're developing on. It's playing to the OS's strengths (fast semaphores, fast context switch times) without introducing a load of extra code to recreate an alternate (and possibly more elegant) solution commonly found on other platforms.
Third, stuck with old GCC and old Boost. Basically I can't help there other than low value suggestions about phoning up WindRiver and discussing buying an upgrade. Personally speaking when I've been programming for VxWorks I've used VxWork's native API rather than POSIX. Ok, so the code hasn't be very portable, but it has ended up being fast; POSIX is merely layer on top of the native API anyway and that will always slow things down.
That said, POSIX counting and mutex semaphores are very similar to VxWork's native counting and mutex semaphores. That probably means that the POSIX layering isn't very thick.
General Notes About Programming for VxWorks
Debugging I always sought to use the development tools (Tornado) available for Solaris. This is by far the best multi-threaded debugging environment I've ever come across. Why? It allows you to start up multiple debug sessions, one for each thread/task in the system. You end up with a debug window per thread, and you are individually and independently debugging each one. Step over a blocking operation, that debug window gets blocked. Move mouse focus to another debugging window, step over the operation that will release the block and watch the first window complete its step.
You end up with a lot of debug windows, but it's by far the best way to debug multi-threaded stuff. It made it veeeeery easy to write really quite complex stuff and see problems. You can easily explore the different dynamic interactions in your application because you had simple and all powerful control over what each thread is doing at any time.
Ironically the Windows version of Tornado didn't let you do this; one miserable single debug windows per system, just like any other boring old IDE such as Visual Studio, etc. I've never seen even modern IDEs come anywhere close to being as good as Tornado on Solaris for multi-threaded debugging.
HardDrives If your readers and writers are using files on disk, consider that VxWorks 5.5 is pretty old. Things like NCQ aren't going to be supported. In this case my proposed solution (outlined above) might be better done with a single mutex semaphore to stop multiple readers tripping over each other in their struggle to read different parts of the disk. It depends on what exactly your readers are doing, but if they're reading contiguous data from a file this would avoid thrashing the read/write head to and fro across the disk surface (very slow).
In my case I was using this trick to shape traffic across a network interface; each task was sending a different sort of data, and the task priority reflected the priority of the data on the network. It was very elegant, no message was ever fragmented, but the important messages got the lions share of the available bandwidth.

As always the best solution will depend on details. A read-write spin lock may be what you're looking for, but other approaches such as read-copy-update as suggested above might be a solution - though on an old embedded platform the extra memory used might be an issue. With rare writes I often arrange the work using a tasking system such that the writes can only occur when there are no reads from that data structure, but this is algorithm dependent.

One algorithm for this based on semaphores and mutexes is described in Concurrent Control with Readers and Writers; P.J. Courtois, F. Heymans, and D.L. Parnas; MBLE Research Laboratory; Brussels, Belgium.

This is a simplified answer based on my Boost headers (I would call Boost an approved way). It only requires Condition Variables and Mutexes. I rewrote it using Windows primitives because I find them descriptive and very simple, but view this as Pseudocode.
This is a very simple solution, which does not support things like mutex upgrading, or try_lock() operations. I can add those if you want. I also took out some frills like disabling interrupts that aren't strictly necessary.
Also, it's worth checking out boost\thread\pthread\shared_mutex.hpp (this being based on that). It's human-readable.
class SharedMutex {
CRITICAL_SECTION m_state_mutex;
CONDITION_VARIABLE m_shared_cond;
CONDITION_VARIABLE m_exclusive_cond;
size_t shared_count;
bool exclusive;
// This causes write blocks to prevent further read blocks
bool exclusive_wait_blocked;
SharedMutex() : shared_count(0), exclusive(false)
{
InitializeConditionVariable (m_shared_cond);
InitializeConditionVariable (m_exclusive_cond);
InitializeCriticalSection (m_state_mutex);
}
~SharedMutex()
{
DeleteCriticalSection (&m_state_mutex);
DeleteConditionVariable (&m_exclusive_cond);
DeleteConditionVariable (&m_shared_cond);
}
// Write lock
void lock(void)
{
EnterCriticalSection (&m_state_mutex);
while (shared_count > 0 || exclusive)
{
exclusive_waiting_blocked = true;
SleepConditionVariableCS (&m_exclusive_cond, &m_state_mutex, INFINITE)
}
// This thread now 'owns' the mutex
exclusive = true;
LeaveCriticalSection (&m_state_mutex);
}
void unlock(void)
{
EnterCriticalSection (&m_state_mutex);
exclusive = false;
exclusive_waiting_blocked = false;
LeaveCriticalSection (&m_state_mutex);
WakeConditionVariable (&m_exclusive_cond);
WakeAllConditionVariable (&m_shared_cond);
}
// Read lock
void lock_shared(void)
{
EnterCriticalSection (&m_state_mutex);
while (exclusive || exclusive_waiting_blocked)
{
SleepConditionVariableCS (&m_shared_cond, m_state_mutex, INFINITE);
}
++shared_count;
LeaveCriticalSection (&m_state_mutex);
}
void unlock_shared(void)
{
EnterCriticalSection (&m_state_mutex);
--shared_count;
if (shared_count == 0)
{
exclusive_waiting_blocked = false;
LeaveCriticalSection (&m_state_mutex);
WakeConditionVariable (&m_exclusive_cond);
WakeAllConditionVariable (&m_shared_cond);
}
else
{
LeaveCriticalSection (&m_state_mutex);
}
}
};
Behavior
Okay, there is some confusion about the behavior of this algorithm, so here is how it works.
During a Write Lock - Both readers and writers are blocked.
At the end of a Write Lock - Reader threads and one writer thread will race to see which one starts.
During a Read Lock - Writers are blocked. Readers are also blocked if and only if a Writer is blocked.
At the release of the final Read Lock - Reader threads and one writer thread will race to see which one starts.
This could cause readers to starve writers if the processor frequently context switches over to a m_shared_cond thread before an m_exclusive_cond during notification, but I suspect that issue is theoretical and not practical since it's Boost's algorithm.

Now that Microsoft has opened up the .NET source code, you can look at their ReaderWRiterLockSlim implementation.
I'm not sure the more basic primitives they use are available to you, some of them are also part of the .NET library and their code is also available.
Microsoft has spent quite a lot of time on improving the performance of their locking mechanisms, so this can be a good starting point.

Using volatile boolean variable for busy waiting

The question arises after reading some codes written by another developers, so I did some research and I found article by Andrei Alexandrescu. In his article he says that it is possible to use volatile boolean variable for busy waiting (see the first example with Wait/Wakeup)
class Gadget
{
public:
void Wait()
{
while (!flag_)
{
Sleep(1000); // sleeps for 1000 milliseconds
}
}
void Wakeup()
{
flag_ = true;
}
...
private:
bool flag_;
};
I don't really get how does it work.
volatile doesn't guarantee that the operations will be atomic. Practically reads/writes to boolean variable are atomic, but theory doesn't guarantee that. From my point of view the code above could be safely rewritten with C++11 by using std::atomic::load/store functions with acquire/release memory ordering constraints correspondingly.
We don't have this problem in the described example, but if we have more then one write we may have problems with memory ordering. Volatile is not a fence, it doesn't force memory ordering, it just prevents compiler optimization.
So why so many people use volatile bool for busy wait and is it really portable?

The article doesn't say that volatile is all you need (indeed, it is not), only that it can be useful.
If you do this, and if you use the simple generic component LockingPtr, you can write thread-safe code and worry much less about race conditions, because the compiler will worry for you and will diligently point out the spots where you are wrong.

I don't really get how does it work.
It relies on two assumptions:
reads and writes to boolean variables are atomic;
all threads have a uniform view of memory, so that modifications made on one thread will be visible to others within a short amount of time without an explicit memory barrier.
The first is likely to hold on any sane architecture. The second holds on any single-core architecture, and on the multi-core architectures in widespread use today, but there's no guarantee that it will continue to hold in the future.
the code above could be safely rewritten with C++11 by using std::atomic
Today, it can and should be. In 2001, when the article was written, not so much.
if we have more then one write we may have problems with memory ordering
Indeed. If this mechanism is used for synchronisation with other data, then we're relying on a third assumption: that modification order is preserved. Again, most popular processors give that behaviour, but there's no guarantee that this will continue.
why so many people use volatile bool for busy wait
Because they can't or won't change habits they formed before C++ acquired a multi-threaded memory model.
and is it really portable?
No. The C++11 memory model doesn't guarantee any of these assumptions, and there's a good chance that they will become impractical for future hardware to support, as the typical number of cores grows. volatile was never a solution for thread synchronisation, and that goes doubly now that the language does provide the correct solutions.

Is it ok to read a shared boolean flag without locking it when another thread may set it (at most once)?

I would like my thread to shut down more gracefully so I am trying to implement a simple signalling mechanism. I don't think I want a fully event-driven thread so I have a worker with a method to graceully stop it using a critical section Monitor (equivalent to a C# lock I believe):
DrawingThread.h
class DrawingThread {
bool stopRequested;
Runtime::Monitor CSMonitor;
CPInfo *pPInfo;
//More..
}
DrawingThread.cpp
void DrawingThread::Run() {
if (!stopRequested)
//Time consuming call#1
if (!stopRequested) {
CSMonitor.Enter();
pPInfo = new CPInfo(/**/);
//Not time consuming but pPInfo must either be null or constructed.
CSMonitor.Exit();
}
if (!stopRequested) {
pPInfo->foobar(/**/);//Time consuming and can be signalled
}
if (!stopRequested) {
//One more optional but time consuming call.
}
}
void DrawingThread::RequestStop() {
CSMonitor.Enter();
stopRequested = true;
if (pPInfo) pPInfo->RequestStop();
CSMonitor.Exit();
}
I understand (at least in Windows) Monitor/locks are the least expensive thread synchronization primitive but I am keen to avoid overuse. Should I be wrapping each read of this boolean flag? It is initialized to false and only set once to true when stop is requested (if it is requested before the task completes).
My tutors advised to protect even bool's because read/writing may not be atomic. I think this one shot flag is the exception that proves the rule?

It is never OK to read something possibly modified in a different thread without synchronization. What level of synchronization is needed depends on what you are actually reading. For primitive types, you should have a look at atomic reads, e.g. in the form of std::atomic<bool>.
The reason synchronization is always needed is that the processors will have the data possibly shared in a cache line. It has no reason to update this value to a value possibly changed in a different thread if there is no synchronization. Worse, yet, if there is no synchronization it may write the wrong value if something stored close to the value is changed and synchronized.

Boolean assignment is atomic. That's not the problem.
The problem is that a thread may not not see changes to a variable done by a different thread due to either compiler or CPU instruction reordering or data caching (i.e. the thread that reads the boolean flag may read a cached value, instead of the actual updated value).
The solution is a memory fence, which indeed is implicitly added by lock statements, but for a single variable it's overkill. Just declare it as std::atomic<bool>.

The answer, I believe, is "it depends." If you're using C++03, threading isn't defined in the Standard, and you'll have to read what your compiler and your thread library say, although this kind of thing is usually called a "benign race" and is usually OK.
If you're using C++11, benign races are undefined behavior. Even when undefined behavior doesn't make sense for the underlying data type. The problem is that compilers can assume that programs have no undefined behavior, and make optimizations based on that (see also the Part 1 and Part 2 linked from there). For instance, your compiler could decide to read the flag once and cache the value because it's undefined behavior to write to the variable in another thread without some kind of mutex or memory barrier.
Of course, it may well be that your compiler promises to not make that optimization. You'll need to look.
The easiest solution is to use std::atomic<bool> in C++11, or something like Hans Boehm's atomic_ops elsewhere.

No, you have to protect every access, since modern compilers and cpus reorder the code without your multithreading tasks in mind. The read access from different threads might work, but don't have to work.

What sync primitives can I use with clone(2) (C/C++)?

What C++ synchronization primitives can I use when using Linux's clone(2) threads? I specifically cannot use pthreads because I'm building a shared library that replaces many of pthreads's function calls with different definitions, but I'm in need of a mutex of some sort.
EDIT: I might have spoke too soon, I looked at the pthread docs and they use a futex(2) to implement these primitives. I'm assuming that is how I would do this, too?

You can use futex http://en.wikipedia.org/wiki/Futex
Here are simple mutex and cond var based on futex http://locklessinc.com/articles/mutex_cv_futex/

Depending on what your requirements are for your synchronization tool you can also use atomic operations. E.g gcc's __sync_lock_test_and_set can be easily used for a spinlock. Such userspace locks that avoid system calls can in many cases be more efficient than kernel based solutions.
Edit: This is the case if you have just a few instructions to protect. Then the probability that a thread that has taken the spinlock is interrupted while holding the lock is very small. If this happens, the waiting time would then be some scheduling cycles, something that on modern systems should only be a no-go for real-time systems. For the average waiting time this should be negligible.

The proper implementation of a futex-based Mutex is described in Ulrich Drepper's paper "Futexes are tricky". The link osgx gave in his comment is outdated. The current version is here:
http://people.redhat.com/drepper/futex.pdf
It includes not only the code but also a very detailed explanation of why it is correct. The code from the paper:
class mutex
{
public:
mutex () : val (0) { }
void lock () {
int c;
if ((c = cmpxchg (val, 0, 1)) != 0)
do {
if (c == 2 || cmpxchg (val, 1, 2) != 0)
futex_wait (&val, 2);
} while ((c = cmpxchg (val, 0, 2)) != 0);
}
void unlock () {
if (atomic_dec (val) != 1) {
val = 0;
futex_wake (&val, 1);
}
}
private:
int val;
};
cmpxchg() and atomic_dec() can be implemented via __sync_val_compare_and_swap() and __sync_fetch_and_add() respectively, which are not standard C but supported by GCC and AFAIK by CLANG, too.
Note, that futex() is Linux-specific. FreeBSD has an emulation AFAIK but I don't know if you can access it natively. If you want to target other platforms than Linux, you cannot use futex(). But since you're using clone() which is Linux-specific, you probably don't care.
About your idea of writing a pthreads-replacement in general:
Forget it.
I went down that route myself because I wanted to have some behaviour offered by clone() but not by pthread_create(). After a while I gave up. glibc is built on the assumption that you are using pthreads. Even (or especially) if you don't link with -lpthread, glibc includes pthread-specific behaviour. The thing that made me give up on clone() was errno. In a multithreaded application you have to provide a thread-local errno unless you want to synchronize each and every libc call with a global mutex. You do that by implementing __errno_location(). Unfortunately glibc has its own __errno_location hardcoded in some places and will not use your replacement.
In other words: If you want to use your own threading library, you cannot use glibc, at least not without reading the source code to each and every function you use and being prepared to replace it if necessary. And if you're thinking of using uClibc instead, I'll have to disappoint you. They copied parts of their implementation from glibc and at least older versions have the above __errno_location issue.
My solution was to give up fighting pthreads. I now use pthread_create() to create my threads (with a nice C++ wrapper, of course). While I haven't tried it yet, the unshare(2) system call should allow me to change those aspects of the thread that I would have liked to set via clone() which pthread_create() doesn't support. If that doesn't work, I will take the glibc source for pthread_create() and hack my options into its call to clone(), keeping the rest identical so that I don't break compatibility.
As for synchronization primitives: Using pthread_create() does NOT force you to use pthread_mutex and company. Because I'm writing a VM where every object has a mutex assigned I don't want the space overhead of pthread_mutex_t. Instead I'm using Drepper's Mutex class from the paper above. You can freely mix and match pthread's primitives and your own. Wherever possible you should use the pthread versions because they are well tested. But for special cases (such as the extremely lightweight mutex) it's okay to create your own futex-based primitives.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js