How does mutex condition signaling loop works?

How does mutex condition signaling loop works? - c++

I will make a hypothetical scenario just to be clear about what I need to know.
Let's say I have a single file being updated very often.
I need to read and parse this file by several different threads.
Everytime this file is rewritten, I'm gonna wake a condition mutex so the other threads can do whatever they want to.
My question is:
If I have 10000 threads, the first thread execution will block the execution of the other 9999 ones?
Does it work in parallel or synchronously?

This post has been edited since first posted to address comments below by Jonathan Wakely, and to better distinguish between a condition_variable, a condition (which were both called condition in the first version), and how the wait function operates. Just as important, however, is an exploration of better methods from modern C++, using std::future, std::thread and std::packaged_task, with some discussion regarding buffering and reasonable thread count.
First, 10,000 threads is a lot of threads. The thread scheduler will be highly burdened on all but the very highest performance of computers. Typical quad core workstations under Windows would struggle. It's a sign that some kind of queued scheduling of tasks is in order, typical of servers accepting thousands of connections using perhaps 10 threads, each servicing 1,000 connects. The number of threads is really not important to the question, but that in such a volume of tasks 10,000 threads is impracticable.
To handle synchronization, the mutex doesn't actually do what you're proposing, by itself. The concept you're describing is a type of event object, perhaps an auto reset event, which by itself is a higher level concept. Windows has them as part of its API, but they are fashioned on Linux (and for portable software, usually) with two primitive components, a mutex and a condition variable. Together these create the auto reset event, and other types of "waitable events" as Windows calls them. In C++ these are provided by std::mutex and std::condition_variable.
Mutexes by themselves merely provide locked control over a common resource. In that scenario we are not thinking in terms of clients and a server (or workers and an executive), but we're thinking in terms of competition among peers for a single resource which can only be accessed by one actor (thread) at a time. A mutex can block execution, but it does not release based on an external signal. Mutexes block if another thread has locked the mutex, and wait indefinitely until the owner of the lock releases it. This isn't the scenario you present in the question.
In your scenario, there are many "clients" and one "server" thread. The server is in charge of signalling that something is ready to be processed. All other threads are clients in this design (nothing about the thread itself makes them clients, we merely deem them so by the function they execute). In some discussions, clients are called worker threads.
The clients use a mutex/condition variable pair to wait for a signal. This construct usually takes the form of locking a mutex, then waiting on the condition variable using that mutex. When a thread enters wait on the condition variable, the mutex is unlocked. This is repeated for all client threads who wait for work to be done. A typical client wait example is:
std::mutex m;
std::condition_variable cv;
void client_thread()
{
// Wait until server signals data is ready
std::unique_lock<std::mutex> lk(m); // lock the mutex
cv.wait(lk); // wait on cv
// do the work
}
This is pseudo code showing the mutex/conditional variable used together. std::condition_variable has two overloads of the wait function, this is the simplest one. The intent is that a thread will block, entering into an idle state until the condition_variable is signalled. It is not intended as a complete example, merely to point out these two objects are used together.
Johnathan Wakely's comments below are based on the fact that wait is not indefinite; there is no guarantee that the reason the call is unblocked is because of a signal. The documentation calls this a "spurious wakeup", which occasionally occurs for complex reasons of OS scheduling. The point which Johnathan makes is that code using this pair must be safe to operate even if the wakeup is not because the condition_variable was signalled.
In the parlance of using condition variables, this is known as a condition (not the condition_variable). The condition is an application defined concept, usually illustrated as a boolean in the literature, and often the result of checking a bool, an integer (sometimes of atomic type) or calling a function returning a bool. Sometimes application defined notions of what constitutes a true condition are more complex, but the overall effect of the condition is to determine whether or not the thread, once awakened, should continue to process, or should simply repeat the wait.
One way to satisfy this requirement is the second version of std::condition_variable::wait. The two are declared:
void wait( std::unique_lock<std::mutex>& lock );
template< class Predicate >
void wait( std::unique_lock<std::mutex>& lock, Predicate pred );
Johnathan's point is to insist the second version be used. However, documentation describes (and the fact there are two overloads indicates) that the Predicate is optional. The Predicate is a functor of some kind, often a lambda expression, resolving to true if the wait should unblock, false if the wait should continue waiting, and it is evaluated under lock. The Predicate is synonymous with condition in that the Predicate is one way in which to indicate true or false regarding whether wait should unblock.
Although the Predicate is, in fact, optional, the notion that 'wait' is not perfect in blocking until a signal is received requires that if the first version is used, it is because the application is constructed such that spurious wakes have no consequence (indeed, are part of the design).
Jonathan's citation shows that the Predicate is evaluated under lock, but in generalized forms of the paradigm that's frequently not practicable. std::condition_variable must wait on a locked std::mutex, which may be protecting a variable defining the condition, but sometimes that's not possible. Sometimes the condition is more complex, external, or trivial enough that the std::mutex isn't associated with the condition.
To see how that works in the context of the proposed solution, assume there are 10 client threads waiting for a server to signal that work is to be done, and that work is scheduled in a queue as a container of virtual functors. A virtual functor might be something like:
struct VFunc
{
virtual void operator()(){}
};
template <typename T>
struct VFunctor
{
// Something referring to T, possible std::function
virtual void operator()(){...call the std::function...}
};
typedef std::deque< VFunc > Queue;
The pseudo code above suggests a typical functor with a virtual operator(), returning void and taking no parameters, sometimes known as a "blind call". The key point in suggesting it is the fact Queue can own a collection of these without knowing what is being called, and whatever VFunctors are in Queue could refer to anything std::function might be able to call, which includes member functions of other objects, lambdas, simple functions, etc. If, however, there is only one function signature to be called, perhaps:
typedef std::deque< std::function<void(void)>> Queue
Is sufficient.
For either case, work is to be done only if there are entries in Queue.
To wait, one might use a class like:
class AutoResetEvent
{
private:
std::mutex m;
std::condition_variable cv;
bool signalled;
bool signalled_all;
unsigned int wcount;
public:
AutoResetEvent() : wcount( 0 ), signalled(false), signalled_all(false) {}
void SignalAll() { std::unique_lock<std::mutex> l(m);
signalled = true;
signalled_all = true;
cv.notify_all();
}
void SignalOne() { std::unique_lock<std::mutex> l(m);
signalled = true;
cv.notify_one();
}
void Wait() { std::unique_lock<std::mutex> l(m);
++wcount;
while( !signalled )
{
cv.wait(l);
}
--wcount;
if ( signalled_all )
{ if ( wcount == 0 )
{ signalled = false;
signalled_all = false;
}
}
else { signalled = false;
}
}
};
This is pseudo code of a standard reset event type of waitable object, compatible with Windows CreateEvent and WaitForSingleObject API, functioning the basic same way.
All client threads end up at cv.wait (this can have a timeout in Windows, using the Windows API, but not with std::condition_variable). At some point, the server signals the event with a call to Signalxxx. Your scenario suggests SignalAll().
If notify_one is called, one of the waiting threads is released, and all others remain asleep. Of notify_all is called, then all threads waiting on that condition are released to do work.
The following might be an example of using AutoResetEvent:
AutoResetEvent evt; // probably not a global
void client()
{
while( !Shutdown ) // assuming some bool to indicate shutdown
{
if ( IsWorkPending() ) DoWork();
evt.Wait();
}
}
void server()
{
// gather data
evt.SignalAll();
}
The use of IsWorkPending() satisfies the notion of a condition, as Jonathan Wakely indicates. Until a shutdown is indidated, this loop will process work if it's pending, and wait for a signal otherwise. Spurious wakeups have no negative effect. IsWorkPending() would check Queue.size(), possibly through an object which protects Queue with a std::mutex or some other synchronization mechanism. If work is pending, DoWork() would sequentially pop entries out of Queue until Queue is empty. Upon return, the loop would again wait for a signal.
With all of that discussed, the combination of mutex and condition_variable is related to an old style of thinking, now outdated in the era of C++11/C++14. Unless you have trouble using a compliant compiler, it would be better to investigate the use of std::promise, std::future and either std::async or std::thread with std::packaged_task. For example, using future, promise, packaged_task and thread could entirely replace the discussion above.
For example:
// a function for threads to execute
int func()
{
// do some work, return status as result
return result;
}
Assuming func does the work you require on the files, these typedefs apply:
typedef std::packaged_task< int() > func_task;
typedef std::future< int > f_int;
typedef std::shared_ptr< f_int > f_int_ptr;
typedef std::vector< f_int_ptr > f_int_vec;
std::future can't be copied, so it's stored using a shared_ptr for ease of use in a vector, but there are various solutions.
Next, an example of using these for 10 threads of work
void executive_function()
{
// a vector of future pointers
f_int_vec future_list;
// start some threads
for( int n=0; n < 10; ++n )
{
// a packaged_task calling func
func_task ft( &func );
// get a future from the task as a shared_ptr
f_int_ptr future_ptr( new f_int( ft.get_future() ) );
// store the task for later use
future_list.push_back( future_ptr );
// launch a thread to call task
std::thread( std::move( ft )).detach();
}
// at this point, 10 threads are running
for( auto &d : future_list )
{
// for each future pointer, wait (block if required)
// for each thread's func to return
d->wait();
// get the result of the func return value
int res = d->get();
}
}
The point here is really in the last range-for loop. The vector stores futures, which the packaged_tasks provided. Those tasks are used to launch threads, and the future is key to synchronizing the executive. Once all threads are running, each is "waited on" with a simple call to the future's wait function, after which the return value of func can be obtained. No mutexes or condition_variables involved (that we know of).
This brings me to the subject of processing files in parallel, no matter how you launch a number of threads. If there were a machine which could handle 10,000 threads, then if each thread were a trivial file oriented operation there would be considerable RAM resources devoted to file processing, all duplicating each other. Depending on the API chosen, there are buffers associated with each read operation.
Let's say the file was 10 Mbytes, and 10,000 threads began operating on it, where each thread used 4 Kbyte buffers for processing. Combined, that suggests there would be 40 Mbytes of buffers to process a 10 Mbyte file. It would be less wasteful to simply read the file into RAM, and offer read only access to all threads from RAM.
That notion is further complicated by the fact that multiple tasks reading from various sections of the file at different times may cause heavy thrashing from a standard hard disk (not so for flash sources), if the disk cache can't keep up. More importantly, though, is that 10,000 threads are all calling system API's for reading the file, each with considerable overhead.
If the source material is a candidate for reading entirely into RAM, the threads could be focused on RAM instead of the file, alleviating that overhead, improving performance. The threads could share read access to the contents without locks.
If the source file is too large to read entirely into RAM, it may still be best read in blocks of the source file, have threads process that portion from a shared memory resource, then move to the next block in a series.

Related

Is it advisable to use locks in multithreaded real-time code when they are contained within a class?

In the situation where you have a high priority real-time data processing thread reading from a parameter and a second thread with lower priority writing to the parameter (or vice versa) you can end up with a data race.
One solution would be to let the parameter be an std::atomic or to protect the reading/writing of the parameter with mutexes. According to this talk it's best to avoid non-lock free atomics and mutexes to solve such a data race for the following reasons:
The real-time thread will occasionally have to wait to acquire the lock. The execution time of the real-time code, which has to meet a deadline, will suddenly have to take into account any code that can be executed inside the lock protected section of the low priority thread.
If the code inside the lock protected section in the low priority thread contains e.g. memory allocations/deallocation, system calls, or other calls with unbounded execution time the real-time thread can no longer guarantee that it's able meet it's deadline.
The low priority thread can be interrupted by threads which have a higher priority but a priority still below that of the real-time thread, leading to priority inversion.
If one can disregard the last point due to an OS with priority inheritance, is it still advisable to avoid locking even if all lock protected sections are contained inside the same class such as in this example:
#include <mutex>
class RealTimeProcessor
{
public:
struct Parameter
{
// some non atomic data
};
void processData()
{
Parameter p;
mtx.lock();
p = m_param;
mtx.unlock();
// do some processing with the local copy of the parameter...
}
void setParameter(Parameter param)
{
mtx.lock();
m_param = param;
mtx.unlock();
}
private:
Parameter m_param;
std::mutex mtx;
};
Where processData() is called from the real-time thread and setParameter() is called from the low priority thread.
My thought is that since all locking is done within the class, I can make sure that no calls with unbounded execution time are made within the lock protected areas, and that the total worst case completion time of the real-time processing still meets the deadline even in the worst case when having to wait to acquire the lock.

Multithreading implementation in threads

I am in process of implementing messages passing from one thread to another
Thread 1: Callback functions are registered with libraries, on callback, functions are invoked and needs to be send to another thread for processing as it takes time.
Thread 2: Thread to check if any messages are available(preferrednas in queue) and process the same.
Is condition_variable usage with mutex a correct approach to start considering thread 2 processing takes time in which multiple other messages can be added by thread 1?

Is condition_variable usage with mutex a correct approach to start considering thread 2 processing takes time in which multiple other messages can be added by thread 1?
The question is a bit vague about how a condition variable and mutex would be used, but yes, there would definitely be a role for such objects. The high-level view would be something like this:
The mutex would protect access to the message queue. Any read or modification of the queue, by any thread, would be done only while holding the mutex locked.
The message-processing thread would block on the CV in the event that it became ready to process a new message but the queue was empty.
The message-generating thread would signal the CV each time it enqueued a new message.
This is exactly a producer / consumer problem, and you can find a lot of information about such problems using that terminology.
But note also that there are multiple message queue implementations already available to serve exactly your purpose ("message queue" is in fact a standard term for these), so you should consider whether you really want to reinvent this wheel.

In general, mutexes are intended to control access between threads; but not great for notifying between threads.
If you design Thread2 to wait on the condition; you can simply process messages as they are received from Thread1.
Here would be a rough implementation
void pushFunction
{
// Obtain the mutex (preferrably scoped lock in boost or c++17)
std::lock_guard lock(myMutex);
const bool empty = myQueue.empty();
myQueue.push(data);
lock.unlock();
if(empty)
{
conditionVar.notify_one();
}
}
In Thread 2
void waitForMessage()
{
std::lock_guard lock(myMutex);
while (myQueue.empty())
{
conditionVar.wait(lock);
}
rxMessage = myQueue.front();
myQueue.pop();
}
It's important to note that the condition can spuriously wake up so it's important to keep it in the 'while empty' loop.
See https://en.cppreference.com/w/cpp/thread/condition_variable

How to avoid race conditions in a condition variable in VxWorks

We're programming on a proprietary embedded platform sitting atop of VxWorks 5.5. In our toolbox, we have a condition variable, that is implemented using a VxWorks binary semaphore.
Now, POSIX provides a wait function that also takes a mutex. This will unlock the mutex (so that some other task might write to the data) and waits for the other task to signal (it is done writing the data). I believe this implements what's called a Monitor, ICBWT.
We need such a wait function, but implementing it is tricky. A simple approach would do this:
bool condition::wait_for(mutex& mutex) const {
unlocker ul(mutex); // relinquish mutex
return wait(event);
} // ul's dtor grabs mutex again
However, this sports a race condition because it allows another task to preempt this one after the unlocking and before the waiting. The other task can write to the date after it was unlocked and signal the condition before this task starts to wait for the semaphore. (We have tested this and this indeed happens and blocks the waiting task forever.)
Given that VxWorks 5.5 doesn't seem to provide an API to temporarily relinquish a semaphore while waiting for a signal, is there a way to implement this on top of the provided synchronization routines?
Note: This is a very old VxWorks version that has been compiled without POSIX support (by the vendor of the proprietary hardware, from what I understood).

This should be quite easy with native vxworks, a message queue is what is required here. Your wait_for method can be used as is.
bool condition::wait_for(mutex& mutex) const
{
unlocker ul(mutex); // relinquish mutex
return wait(event);
} // ul's dtor grabs mutex again
but the wait(event) code would look like this:
wait(event)
{
if (msgQRecv(event->q, sigMsgBuf, sigMsgSize, timeoutTime) == OK)
{
// got it...
}
else
{
// timeout, report error or something like that....
}
}
and your signal code would like something like this:
signal(event)
{
msgQSend(event->q, sigMsg, sigMsgSize, NO_WAIT, MSG_PRI_NORMAL);
}
So if the signal gets triggered before you start waiting, then msgQRecv will return immediately with the signal when it eventually gets invoked and you can then take the mutex again in the ul dtor as stated above.
The event->q is a MSG_Q_ID that is created at event creation time with a call to msgQCreate, and the data in sigMsg is defined by you... but can be just a random byte of data, or you can come up with a more intelligent structure with information regarding who signaled or something else that may be nice to know.
Update for multiple waiters, this is a little tricky: So there are a couple of assumptions I will make to simplify things
The number of tasks that will be pending is known at event creation time and is constant.
There will be one task that is always responsible for indicating when it is ok to unlock the mutex, all other tasks just want notification when the event is signaled/complete.
This approach uses a counting semaphore, similar to the above with just a little extra logic:
wait(event)
{
if (semTake(event->csm, timeoutTime) == OK)
{
// got it...
}
else
{
// timeout, report error or something like that....
}
}
and your signal code would like something like this:
signal(event)
{
for (int x = 0; x < event->numberOfWaiters; x++)
{
semGive(event->csm);
}
}
The creation of the event is something like this, remember in this example the number of waiters is constant and known at event creation time. You could make it dynamic, but the key is that every time the event is going to happen the numberOfWaiters must be correct before the unlocker unlocks the mutex.
createEvent(numberOfWaiters)
{
event->numberOfWaiters = numberOfWaiters;
event->csv = semCCreate(SEM_Q_FIFO, 0);
return event;
}
You cannot be wishy-washy about the numberOfWaiters :D I will say it again: The numberOfWaiters must be correct before the unlocker unlocks the mutex. To make it dynamic (if that is a requirement) you could add a setNumWaiters(numOfWaiters) function, and call that in the wait_for function before the unlocker unlocks the mutex, so long as it always sets the number correctly.
Now for the last trick, as stated above the assumption is that one task is responsible for unlocking the mutex, the rest just wait for the signal, which means that one and only one task will call the wait_for() function above, and the rest of the tasks just call the wait(event) function.
With this in mind the numberOfWaiters is computed as follows:
The number of tasks who will call wait()
plus 1 for the task that calls wait_for()
Of course you can also make this more complex if you really need to, but chances are this will work because normally 1 task triggers an event, but many tasks want to know it is complete, and that is what this provides.
But your basic flow is as follows:
init()
{
event->createEvent(3);
}
eventHandler()
{
locker l(mutex);
doEventProcessing();
signal(event);
}
taskA()
{
doOperationThatTriggersAnEvent();
wait_for(mutex);
eventComplete();
}
taskB()
{
doWhateverIWant();
// now I need to know if the event has occurred...
wait(event);
coolNowIKnowThatIsDone();
}
taskC()
{
taskCIsFun();
wait(event);
printf("event done!\n");
}
When I write the above I feel like all OO concepts are dead, but hopefully you get the idea, in reality wait and wait_for should take the same parameter, or no parameter but rather be members of the same class that also has all the data they need to know... but none the less that is the overview of how it works.

Race conditions can be avoided if each waiting task waits on a separate binary semaphore.
These semaphores must be registered in a container which the signaling task uses to unblock all waiting tasks. The container must be protected by a mutex.
The wait_for() method obtains a binary semaphore, waits on it and finally deletes it.
void condition::wait_for(mutex& mutex) {
SEM_ID sem = semBCreate(SEM_Q_PRIORITY, SEM_EMPTY);
{
lock l(listeners_mutex); // assure exclusive access to listeners container
listeners.push_back(sem);
} // l's dtor unlocks listeners_mutex again
unlocker ul(mutex); // relinquish mutex
semTake(sem, WAIT_FOREVER);
{
lock l(listeners_mutex);
// remove sem from listeners
// ...
semDelete(sem);
}
} // ul's dtor grabs mutex again
The signal() method iterates over all registered semaphores and unlocks them.
void condition::signal() {
lock l(listeners_mutex);
for_each (listeners.begin(), listeners.end(), /* call semGive()... */ )
}
This approach assures that wait_for() will never miss a signal. A disadvantage is the need of additional system resources.
To avoid creating and destroying semaphores for every wait_for() call, a pool could be used.

From the description, it looks like you may want to implement (or use) a semaphore - it's a standard CS algorithm with semantics similar to condvars, and there are tons of textbooks on how to implement them (https://www.google.com/search?q=semaphore+algorithm).
A random Google result which explains semaphores is at: http://www.cs.cornell.edu/courses/cs414/2007sp/lectures/08-bakery.ppt‎ (see slide 32).

Lightest synchronization primitive for worker thread queue

I am about to implement a worker thread with work item queuing, and while I was thinking about the problem, I wanted to know if I'm doing the best thing.
The thread in question will have to have some thread local data (preinitialized at construction) and will loop on work items until some condition will be met.
pseudocode:
volatile bool run = true;
int WorkerThread(param)
{
localclassinstance c1 = new c1();
[other initialization]
while(true) {
[LOCK]
[unqueue work item]
[UNLOCK]
if([hasWorkItem]) {
[process data]
[PostMessage with pointer to data]
}
[Sleep]
if(!run)
break;
}
[uninitialize]
return 0;
}
I guess I will do the locking via critical section, as the queue will be std::vector or std::queue, but maybe there is a better way.
The part with Sleep doesn't look too great, as there will be a lot of extra Sleep with big Sleep values, or lot's of extra locking when Sleep value is small, and that's definitely unnecessary.
But I can't think of a WaitForSingleObject friendly primitive I could use instead of critical section, as there might be two threads queuing work items at the same time. So Event, which seems to be the best candidate, can loose the second work item if the Event was set already, and it doesn't guarantee a mutual exclusion.
Maybe there is even a better approach with InterlockedExchange kind of functions that leads to even less serialization.
P.S.: I might need to preprocess the whole queue and drop the obsolete work items during the unqueuing stage.

There are a multitude of ways to do this.
One option is to use a semaphore for the waiting. The semaphore is signalled every time a value is pushed on the queue, so the worker thread will only block if there are no items in the queue. This will still require separate synchronization on the queue itself.
A second option is to use a manual-reset event which is set when there are items in the queue and cleared when the queue is empty. Again, you will need to do separate synchronization on the queue.
A third option is to have an invisible message-only window created on the thread, and use a special WM_USER or WM_APP message to post items to the queue, attaching the item to the message via a pointer.
Another option is to use condition variables. The native Windows condition variables only work if you're targetting Windows Vista or Windows 7, but condition variables are also available for Windows XP with Boost or an implementation of the C++0x thread library. An example queue using boost condition variables is available on my blog: http://www.justsoftwaresolutions.co.uk/threading/implementing-a-thread-safe-queue-using-condition-variables.html

It is possible to share a resource between threads without using blocking locks at all, if your scenario meets certain requirements.
You need an atomic pointer exchange primitive, such as Win32's InterlockedExchange. Most processor architectures provide some sort of atomic swap, and it's usually much less expensive than acquiring a formal lock.
You can store your queue of work items in a pointer variable that is accessible to all the threads that will be interested in it. (global var, or field of an object that all the threads have access to)
This scenario assumes that the threads involved always have something to do, and only occasionally "glance" at the shared resource. If you want a design where threads block waiting for input, use a traditional blocking event object.
Before anything begins, create your queue or work item list object and assign it to the shared pointer variable.
Now, when producers want to push something onto the queue, they "acquire" exclusive access to the queue object by swapping a null into the shared pointer variable using InterlockedExchange. If the result of the swap returns a null, then somebody else is currently modifying the queue object. Sleep(0) to release the rest of your thread's time slice, then loop to retry the swap until it returns non-null. Even if you end up looping a few times, this is many. many times faster than making a kernel call to acquire a mutex object. Kernel calls require hundreds of clock cycles to transition into kernel mode.
When you successfully obtain the pointer, make your modifications to the queue, then swap the queue pointer back into the shared pointer.
When consuming items from the queue, you do the same thing: swap a null into the shared pointer and loop until you get a non-null result, operate on the object in the local var, then swap it back into the shared pointer var.
This technique is a combination of atomic swap and brief spin loops. It works well in scenarios where the threads involved are not blocked and collisions are rare. Most of the time the swap will give you exclusive access to the shared object on the first try, and as long as the length of time the queue object is held exclusively by any thread is very short then no thread should have to loop more than a few times before the queue object becomes available again.
If you expect a lot of contention between threads in your scenario, or you want a design where threads spend most of their time blocked waiting for work to arrive, you may be better served by a formal mutex synchronization object.

The fastest locking primitive is usually a spin-lock or spin-sleep-lock. CRITICAL_SECTION is just such a (user-space) spin-sleep-lock.
(Well, aside from not using locking primitives at all of course. But that means using lock-free data-structures, and those are really really hard to get right.)
As for avoiding the Sleep: have a look at condition-variables. They're designed to be used together with a "mutex", and I think they're much easier to use correctly than Windows' EVENTs.
Boost.Thread has a nice portable implementation of both, fast user-space spin-sleep-locks and condition variables:
http://www.boost.org/doc/libs/1_44_0/doc/html/thread/synchronization.html#thread.synchronization.condvar_ref
A work-queue using Boost.Thread could look something like this:
template <class T>
class Queue : private boost::noncopyable
{
public:
void Enqueue(T const& t)
{
unique_lock lock(m_mutex);
// wait until the queue is not full
while (m_backingStore.size() >= m_maxSize)
m_queueNotFullCondition.wait(lock); // releases the lock temporarily
m_backingStore.push_back(t);
m_queueNotEmptyCondition.notify_all(); // notify waiters that the queue is not empty
}
T DequeueOrBlock()
{
unique_lock lock(m_mutex);
// wait until the queue is not empty
while (m_backingStore.empty())
m_queueNotEmptyCondition.wait(lock); // releases the lock temporarily
T t = m_backingStore.front();
m_backingStore.pop_front();
m_queueNotFullCondition.notify_all(); // notify waiters that the queue is not full
return t;
}
private:
typedef boost::recursive_mutex mutex;
typedef boost::unique_lock<boost::recursive_mutex> unique_lock;
size_t const m_maxSize;
mutex mutable m_mutex;
boost::condition_variable_any m_queueNotEmptyCondition;
boost::condition_variable_any m_queueNotFullCondition;
std::deque<T> m_backingStore;
};

There are various ways to do this
For one you could create an event instead called 'run' and then use that to detect when thread should terminate, the main thread then signals. Instead of sleep you would then use WaitForSingleObject with a timeout, that way you will quit directly instead of waiting for sleep ms.
Another way is to accept messages in your loop and then invent a user defined message that you post to the thread
EDIT: depending on situation it may also be wise to have yet another thread that monitors this thread to check if it is dead or not, this can be done by the above mentioned message queue so replying to a certain message within x ms would mean that the thread hasn't locked up.

I'd restructure a bit:
WorkItem GetWorkItem()
{
while(true)
{
WaitForSingleObject(queue.Ready);
{
ScopeLock lock(queue.Lock);
if(!queue.IsEmpty())
{
return queue.GetItem();
}
}
}
}
int WorkerThread(param)
{
bool done = false;
do
{
WorkItem work = GetWorkItem();
if( work.IsQuitMessage() )
{
done = true;
}
else
{
work.Process();
}
} while(!done);
return 0;
}
Points of interest:
ScopeLock is a RAII class to make critical section usage safer.
Block on event until workitem is (possibly) ready - then lock while trying to dequeue it.
don't use a global "IsDone" flag, enqueue special quitmessage WorkItems.

You can have a look at another approach here that uses C++0x atomic operations
http://www.drdobbs.com/high-performance-computing/210604448

Use a semaphore instead of an event.

Keep the signaling and synchronizing separate. Something along these lines...
// in main thread
HANDLE events[2];
events[0] = CreateEvent(...); // for shutdown
events[1] = CreateEvent(...); // for work to do
// start thread and pass the events
// in worker thread
DWORD ret;
while (true)
{
ret = WaitForMultipleObjects(2, events, FALSE, <timeout val or INFINITE>);
if shutdown
return
else if do-work
enter crit sec
unqueue work
leave crit sec
etc.
else if timeout
do something else that has to be done
}

Given that this question is tagged windows, Ill answer thus:
Don't create 1 worker thread. Your worker thread jobs are presumably independent, so you can process multiple jobs at once? If so:
In your main thread call CreateIOCompletionPort to create an io completion port object.
Create a pool of worker threads. The number you need to create depends on how many jobs you might want to service in parallel. Some multiple of the number of CPU cores is a good start.
Each time a job comes in call PostQueuedCompletionStatus() passing a pointer to the job struct as the lpOverlapped struct.
Each worker thread calls GetQueuedCompletionItem() - retrieves the work item from the lpOverlapped pointer and does the job before returning to GetQueuedCompletionStatus.
This looks heavy, but io completion ports are implemented in kernel mode and represent a queue that can be deserialized into any of the worker threads associated with the queue (i.e. waiting on a call to GetQueuedCompletionStatus). The io completion port knows how many of the threads that are processing an item are actually using a CPU vs blocked on an IO call - and will release more worker threads from the pool to ensure that the concurrency count is met.
So, its not lightweight, but it is very very efficient... io completion port can be associated with pipe and socket handles for example and can dequeue the results of asynchronous operations on those handles. io completion port designs can scale to handling 10's of thousands of socket connects on a single server - but on the desktop side of the world make a very convenient way of scaling processing of jobs over the 2 or 4 cores now common in desktop PCs.

Modelling boost::Lockable with semaphore rather than mutex (previously titled: Unlocking a mutex from a different thread)

I'm using the C++ boost::thread library, which in my case means I'm using pthreads. Officially, a mutex must be unlocked from the same thread which locks it, and I want the effect of being able to lock in one thread and then unlock in another. There are many ways to accomplish this. One possibility would be to write a new mutex class which allows this behavior.
For example:
class inter_thread_mutex{
bool locked;
boost::mutex mx;
boost::condition_variable cv;
public:
void lock(){
boost::unique_lock<boost::mutex> lck(mx);
while(locked) cv.wait(lck);
locked=true;
}
void unlock(){
{
boost::lock_guard<boost::mutex> lck(mx);
if(!locked) error();
locked=false;
}
cv.notify_one();
}
// bool try_lock(); void error(); etc.
}
I should point out that the above code doesn't guarantee FIFO access, since if one thread calls lock() while another calls unlock(), this first thread may acquire the lock ahead of other threads which are waiting. (Come to think of it, the boost::thread documentation doesn't appear to make any explicit scheduling guarantees for either mutexes or condition variables). But let's just ignore that (and any other bugs) for now.
My question is, if I decide to go this route, would I be able to use such a mutex as a model for the boost Lockable concept. For example, would anything go wrong if I use a boost::unique_lock< inter_thread_mutex > for RAII-style access, and then pass this lock to boost::condition_variable_any.wait(), etc.
On one hand I don't see why not. On the other hand, "I don't see why not" is usually a very bad way of determining whether something will work.
The reason I ask is that if it turns out that I have to write wrapper classes for RAII locks and condition variables and whatever else, then I'd rather just find some other way to achieve the same effect.
EDIT:
The kind of behavior I want is basically as follows. I have an object, and it needs to be locked whenever it is modified. I want to lock the object from one thread, and do some work on it. Then I want to keep the object locked while I tell another worker thread to complete the work. So the first thread can go on and do something else while the worker thread finishes up. When the worker thread gets done, it unlocks the mutex.
And I want the transition to be seemless so nobody else can get the mutex lock in between when thread 1 starts the work and thread 2 completes it.
Something like inter_thread_mutex seems like it would work, and it would also allow the program to interact with it as if it were an ordinary mutex. So it seems like a clean solution. If there's a better solution, I'd be happy to hear that also.
EDIT AGAIN:
The reason I need locks to begin with is that there are multiple master threads, and the locks are there to prevent them from accessing shared objects concurrently in invalid ways.
So the code already uses loop-level lock-free sequencing of operations at the master thread level. Also, in the original implementation, there were no worker threads, and the mutexes were ordinary kosher mutexes.
The inter_thread_thingy came up as an optimization, primarily to improve response time. In many cases, it was sufficient to guarantee that the "first part" of operation A, occurs before the "first part" of operation B. As a dumb example, say I punch object 1 and give it a black eye. Then I tell object 1 to change it's internal structure to reflect all the tissue damage. I don't want to wait around for the tissue damage before I move on to punch object 2. However, I do want the tissue damage to occur as part of the same operation; for example, in the interim, I don't want any other thread to reconfigure the object in such a way that would make tissue damage an invalid operation. (yes, this example is imperfect in many ways, and no I'm not working on a game)
So we made the change to a model where ownership of an object can be passed to a worker thread to complete an operation, and it actually works quite nicely; each master thread is able to get a lot more operations done because it doesn't need to wait for them all to complete. And, since the event sequencing at the master thread level is still loop-based, it is easy to write high-level master-thread operations, as they can be based on the assumption that an operation is complete (more precisely, the critical "first part" upon which the sequencing logic depends is complete) when the corresponding function call returns.
Finally, I thought it would be nice to use inter_thread mutex/semaphore thingies using RAII with boost locks to encapsulate the necessary synchronization that is required to make the whole thing work.

man pthread_unlock (this is on OS X, similar wording on Linux) has the answer:
NAME
pthread_mutex_unlock -- unlock a mutex
SYNOPSIS
#include <pthread.h>
int
pthread_mutex_unlock(pthread_mutex_t *mutex);
DESCRIPTION
If the current thread holds the lock on mutex, then the
pthread_mutex_unlock() function unlocks mutex.
Calling pthread_mutex_unlock() with a mutex that the
calling thread does not hold will result in
undefined behavior.
...
My counter-question would be - what kind of synchronization problem are you trying to solve with this? Most probably there is an easier solution.
Neither pthreads nor boost::thread (built on top of it) guarantee any order in which a contended mutex is acquired by competing threads.

Sorry, but I don't understand. what will be the state of your mutex in line [1] in the following code if another thread can unlock it?
inter_thread_mutex m;
{
m.lock();
// [1]
m.unlock();
}
This has no sens.

There's a few ways to approach this. Both of the ones I'm going to suggest are going to involve adding an additional piece of information to the object, rather adding a mechanism to unlock a thread from a thread other than the one that owns it.
1) you can add some information to indicate the object's state:
enum modification_state { consistent, // ready to be examined or to start being modified
phase1_complete, // ready for the second thread to finish the work
};
// first worker thread
lock();
do_init_work(object);
object.mod_state = phase1_complete;
unlock();
signal();
do_other_stuff();
// second worker thread
lock()
while( object.mod_state != phase1_complete )
wait()
do_final_work(obj)
object.mod_state = consistent;
unlock()
signal()
// some other thread that needs to read the data
lock()
while( object.mod_state != consistent )
wait();
read_data(obj)
unlock()
Works just fine with condition variables, because obviously you're not writing your own lock.
2) If you have a specific thread in mind, you can give the object an owner.
// first worker
lock();
while( obj.owner != this_thread() ) wait();
do_initial_work(obj);
obj.owner = second_thread_id;
unlock()
signal()
...
This is pretty much the same solution as my first solution, but more flexible in the adding/removing of phases, and less flexible in the adding/removing of threads.
To be honest, I'm not sure how inter thread mutex would help you here. You'd still need a semaphore or condition variable to signal the passing of the work to the second thread.

Small modification to what you already have: how about storing the id of the thread which you want to take the lock, in your inter_thread_whatever? Then unlock it, and send a message to that thread, saying "I want you execute whatever routine it is that tries to take this lock".
Then the condition in lock becomes while(locked || (desired_locker != thisthread && desired_locker != 0)). Technically you've "released the lock" in the first thread, and "taken it again" in the second thread, but there's no way that any other thread can grab it in between, so it's as if you've transferred it directly from one to the other.
There's a potential problem, that if a thread exits or is killed, while it's the desired locker of your lock, then that thread deadlocks. But you were already talking about the first thread waiting for a message from the second thread to say that it has successfully acquired the lock, so presumably you already have a plan in mind for what happens if that message is never received. To that plan, add "reset the desired_locker field on the inter_thread_whatever".
This is all very hairy, though, I'm not convinced that what I've proposed is correct. Is there a way that the "master" thread (the one that's directing all these helpers) can just make sure that it doesn't order any more operations to be performed on whatever is protected by this lock, until the first op is completed (or fails and some RAII thing notifies you)? You don't need locks as such, if you can deal with it at the level of the message loop.

I don't think it is a good idea to say that your inter_thread_mutex (binary_semaphore) can be seen as a model of Lockable. The main issue is that the main feature of your inter_thread_mutex defeats the Locakble concept. If inter_thread_mutex was a model of lockable you will expect in In [1] that the inter_thread_mutex m is locked.
// thread T1
inter_thread_mutex m;
{
unique_lock<inter_thread_mutex> lk(m);
// [1]
}
But as an other thread T2 can do m.unlock() while T1 is in [1], the guaranty is broken.
Binary semaphores can be used as Lockables as far as each thread tries to lock before unlocking. But the main goal of your class is exactly the contrary.
This is one of the reason semaphores in Boost.Interprocess don't use lock/unlock to name the functions, but wait/notify. Curiously these are the same names used by conditions :)

A mutex is a mechanism for describing mutually exclusive blocks of code. It does not make sense for these blocks of code to cross thread boundaries. Trying to use such a concept in such an counter intuitive way can only lead to problems down the line.
It sounds very much like you're looking for a different multi-threading concept, but without more detail it's hard to know what.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js