Elegant ways to notify consumer when producer is done? - c++

I'm implementing a concurrent_blocking_queue with minimal functions:
//a thin wrapper over std::queue
template<typename T>
class concurrent_blocking_queue
{
std::queue<T> m_internal_queue;
//...
public:
void add(T const & item);
T& remove();
bool empty();
};
I intend to use this for producer-consumer problem (I guess, it is where one uses such data structures?). But I'm stuck on one problem which is:
How to elegantly notify consumer when producer is done? How would the producer notify the queue when it is done? By calling a specifiic member function, say done()? Is throwing exception from the queue (i.e from remove function) a good idea?
I came across many examples, but all has infinite loop as if the producer will produce items forever. None discussed the issue of stopping condition, not even the wiki article.

I've simply introduced a dummy "done" product in the past. So if the producer can create "products" of, say, type A and type B, I've invented type "done". When a consumer encounters a product of type "done" it knows that further processing isn't required anymore.

It is true that it's common to enqueue a special "we're done" message; however I think OP's original desire for an out-of-band indicator is reasonable. Look at the complexity people are contemplating to set up an in-band completion message! Proxy types, templating; good grief. I'd say a done() method is simpler and easier, and it makes the common case (we're not done yet) faster and cleaner.
I would agree with kids_fox that a try_remove that returns an error code if the queue is done is preferred, but that's stylistic and YMMV.
Edit:
Bonus points for implementing a queue that keeps track of how many producers are remaining in a multiple-producers situation and raises the done exception iff all producers have thrown in the towel ;-) Not going to do that with in-band messages!

My queues have usually used pointers (with an std::auto_ptr in the
interface, to clearly indicate that the sender may no longer access the
pointer); for the most part, the queued objects were polymorphic, so
dynamic allocation and reference semantics were required anyway.
Otherwise, it shouldn't be too difficult to add an “end of
file” flag to the queue. You'd need a special function on the
producer side (close?) to set it (using exactly the same locking
primitives as when you write to the queue), and the loop in the removal
function must wait for either something to be there, or the queue to be
closed. Of course, you'll need to return a Fallible value, so that
the reader can know whether the read succeeded or not. Also, don't
forget that in this case, you need a notify_all to ensure that all
processes waiting on the condition are awoken.
BTW: I don't quite see how your interface is implementable. What does
the T& returned by remove refer to. Basically, remove has to be
something like:
Fallible<T>
MessageQueue<T>::receive()
{
ScopedLock l( myMutex );
while ( myQueue.empty() && ! myIsDone )
myCondition.wait( myMutex );
Fallible<T> results;
if ( !myQueue.empty() ) {
results.validate( myQueue.top() );
myQueue.pop();
}
return results;
}
Even without the myIsDone condition, you have to read the value into a
local variable before removing it from the queue, and you can't return a
reference to a local variable.
For the rest:
void
MessageQueue<T>::send( T const& newValue )
{
ScopedLock l( myMutex );
myQueue.push( newValue );
myCondition.notify_all();
}
void
MessageQueue<T>::close()
{
ScopedLock l( myMutex );
myIsDone = true;
myCondition.notify_all();
}

'Stopping' is not often discussed because it's often never done. In those cases where it is required, it's often just as easier and more flexible to enqueue a poison-pill using the higher-level P-C protocol itself as it is to build extra functionality into the queue itself.
If you really want to do this, you could indeed set a flag that causes every consumer to raise an exception, either 'immediately' or whenever it gets back to the queue, but there are problems. Do you need the 'done' method to be synchronous, ie. do you want all the consumers gone by the time 'done' returns, or asynchronous, ie. the last consumer thread calls an event parameter when all the other the consumers are gone?
How are you going to arrange for those consumers that are currently waiting to wake up? How many are waiting and how many are busy, but will return to the queue when they have done their work? What if one or more consumers are stuck on a blocking call, (perhaps they can be unblocked, but that requires a call from another thread - how are you going to do that)?
How are the consumers going to notify that they have handled their exception and are about to die? Is 'about to die' enough, or do you need to wait on the thread handle? If you have to wait on the thread handle, what is going to do the waiting - the thread requesting the queue shutdown or the last consumer thread to notify?
Oh yes - to be safe, you should arrange for producer threads that turn up with objects to queue up while in 'shutting down' state to raise an exception as well.
I raise these questions becasue I've done all this once, a long time ago. Eventually, it all worked-ish. The objects queued up all had to have a 'QueuedItem' inserted into their inheritance chain, (so that a job-cancellation method could be exposed to the queue), and the queue had to keep a thread-safe list of objects that had been popped-off by
threads but not processed yet.
After a while, I stopped using the class in favour of a simple P-C queue with no special shutdown mechanism.

Related

How does mutex condition signaling loop works?

I will make a hypothetical scenario just to be clear about what I need to know.
Let's say I have a single file being updated very often.
I need to read and parse this file by several different threads.
Everytime this file is rewritten, I'm gonna wake a condition mutex so the other threads can do whatever they want to.
My question is:
If I have 10000 threads, the first thread execution will block the execution of the other 9999 ones?
Does it work in parallel or synchronously?
This post has been edited since first posted to address comments below by Jonathan Wakely, and to better distinguish between a condition_variable, a condition (which were both called condition in the first version), and how the wait function operates. Just as important, however, is an exploration of better methods from modern C++, using std::future, std::thread and std::packaged_task, with some discussion regarding buffering and reasonable thread count.
First, 10,000 threads is a lot of threads. The thread scheduler will be highly burdened on all but the very highest performance of computers. Typical quad core workstations under Windows would struggle. It's a sign that some kind of queued scheduling of tasks is in order, typical of servers accepting thousands of connections using perhaps 10 threads, each servicing 1,000 connects. The number of threads is really not important to the question, but that in such a volume of tasks 10,000 threads is impracticable.
To handle synchronization, the mutex doesn't actually do what you're proposing, by itself. The concept you're describing is a type of event object, perhaps an auto reset event, which by itself is a higher level concept. Windows has them as part of its API, but they are fashioned on Linux (and for portable software, usually) with two primitive components, a mutex and a condition variable. Together these create the auto reset event, and other types of "waitable events" as Windows calls them. In C++ these are provided by std::mutex and std::condition_variable.
Mutexes by themselves merely provide locked control over a common resource. In that scenario we are not thinking in terms of clients and a server (or workers and an executive), but we're thinking in terms of competition among peers for a single resource which can only be accessed by one actor (thread) at a time. A mutex can block execution, but it does not release based on an external signal. Mutexes block if another thread has locked the mutex, and wait indefinitely until the owner of the lock releases it. This isn't the scenario you present in the question.
In your scenario, there are many "clients" and one "server" thread. The server is in charge of signalling that something is ready to be processed. All other threads are clients in this design (nothing about the thread itself makes them clients, we merely deem them so by the function they execute). In some discussions, clients are called worker threads.
The clients use a mutex/condition variable pair to wait for a signal. This construct usually takes the form of locking a mutex, then waiting on the condition variable using that mutex. When a thread enters wait on the condition variable, the mutex is unlocked. This is repeated for all client threads who wait for work to be done. A typical client wait example is:
std::mutex m;
std::condition_variable cv;
void client_thread()
{
// Wait until server signals data is ready
std::unique_lock<std::mutex> lk(m); // lock the mutex
cv.wait(lk); // wait on cv
// do the work
}
This is pseudo code showing the mutex/conditional variable used together. std::condition_variable has two overloads of the wait function, this is the simplest one. The intent is that a thread will block, entering into an idle state until the condition_variable is signalled. It is not intended as a complete example, merely to point out these two objects are used together.
Johnathan Wakely's comments below are based on the fact that wait is not indefinite; there is no guarantee that the reason the call is unblocked is because of a signal. The documentation calls this a "spurious wakeup", which occasionally occurs for complex reasons of OS scheduling. The point which Johnathan makes is that code using this pair must be safe to operate even if the wakeup is not because the condition_variable was signalled.
In the parlance of using condition variables, this is known as a condition (not the condition_variable). The condition is an application defined concept, usually illustrated as a boolean in the literature, and often the result of checking a bool, an integer (sometimes of atomic type) or calling a function returning a bool. Sometimes application defined notions of what constitutes a true condition are more complex, but the overall effect of the condition is to determine whether or not the thread, once awakened, should continue to process, or should simply repeat the wait.
One way to satisfy this requirement is the second version of std::condition_variable::wait. The two are declared:
void wait( std::unique_lock<std::mutex>& lock );
template< class Predicate >
void wait( std::unique_lock<std::mutex>& lock, Predicate pred );
Johnathan's point is to insist the second version be used. However, documentation describes (and the fact there are two overloads indicates) that the Predicate is optional. The Predicate is a functor of some kind, often a lambda expression, resolving to true if the wait should unblock, false if the wait should continue waiting, and it is evaluated under lock. The Predicate is synonymous with condition in that the Predicate is one way in which to indicate true or false regarding whether wait should unblock.
Although the Predicate is, in fact, optional, the notion that 'wait' is not perfect in blocking until a signal is received requires that if the first version is used, it is because the application is constructed such that spurious wakes have no consequence (indeed, are part of the design).
Jonathan's citation shows that the Predicate is evaluated under lock, but in generalized forms of the paradigm that's frequently not practicable. std::condition_variable must wait on a locked std::mutex, which may be protecting a variable defining the condition, but sometimes that's not possible. Sometimes the condition is more complex, external, or trivial enough that the std::mutex isn't associated with the condition.
To see how that works in the context of the proposed solution, assume there are 10 client threads waiting for a server to signal that work is to be done, and that work is scheduled in a queue as a container of virtual functors. A virtual functor might be something like:
struct VFunc
{
virtual void operator()(){}
};
template <typename T>
struct VFunctor
{
// Something referring to T, possible std::function
virtual void operator()(){...call the std::function...}
};
typedef std::deque< VFunc > Queue;
The pseudo code above suggests a typical functor with a virtual operator(), returning void and taking no parameters, sometimes known as a "blind call". The key point in suggesting it is the fact Queue can own a collection of these without knowing what is being called, and whatever VFunctors are in Queue could refer to anything std::function might be able to call, which includes member functions of other objects, lambdas, simple functions, etc. If, however, there is only one function signature to be called, perhaps:
typedef std::deque< std::function<void(void)>> Queue
Is sufficient.
For either case, work is to be done only if there are entries in Queue.
To wait, one might use a class like:
class AutoResetEvent
{
private:
std::mutex m;
std::condition_variable cv;
bool signalled;
bool signalled_all;
unsigned int wcount;
public:
AutoResetEvent() : wcount( 0 ), signalled(false), signalled_all(false) {}
void SignalAll() { std::unique_lock<std::mutex> l(m);
signalled = true;
signalled_all = true;
cv.notify_all();
}
void SignalOne() { std::unique_lock<std::mutex> l(m);
signalled = true;
cv.notify_one();
}
void Wait() { std::unique_lock<std::mutex> l(m);
++wcount;
while( !signalled )
{
cv.wait(l);
}
--wcount;
if ( signalled_all )
{ if ( wcount == 0 )
{ signalled = false;
signalled_all = false;
}
}
else { signalled = false;
}
}
};
This is pseudo code of a standard reset event type of waitable object, compatible with Windows CreateEvent and WaitForSingleObject API, functioning the basic same way.
All client threads end up at cv.wait (this can have a timeout in Windows, using the Windows API, but not with std::condition_variable). At some point, the server signals the event with a call to Signalxxx. Your scenario suggests SignalAll().
If notify_one is called, one of the waiting threads is released, and all others remain asleep. Of notify_all is called, then all threads waiting on that condition are released to do work.
The following might be an example of using AutoResetEvent:
AutoResetEvent evt; // probably not a global
void client()
{
while( !Shutdown ) // assuming some bool to indicate shutdown
{
if ( IsWorkPending() ) DoWork();
evt.Wait();
}
}
void server()
{
// gather data
evt.SignalAll();
}
The use of IsWorkPending() satisfies the notion of a condition, as Jonathan Wakely indicates. Until a shutdown is indidated, this loop will process work if it's pending, and wait for a signal otherwise. Spurious wakeups have no negative effect. IsWorkPending() would check Queue.size(), possibly through an object which protects Queue with a std::mutex or some other synchronization mechanism. If work is pending, DoWork() would sequentially pop entries out of Queue until Queue is empty. Upon return, the loop would again wait for a signal.
With all of that discussed, the combination of mutex and condition_variable is related to an old style of thinking, now outdated in the era of C++11/C++14. Unless you have trouble using a compliant compiler, it would be better to investigate the use of std::promise, std::future and either std::async or std::thread with std::packaged_task. For example, using future, promise, packaged_task and thread could entirely replace the discussion above.
For example:
// a function for threads to execute
int func()
{
// do some work, return status as result
return result;
}
Assuming func does the work you require on the files, these typedefs apply:
typedef std::packaged_task< int() > func_task;
typedef std::future< int > f_int;
typedef std::shared_ptr< f_int > f_int_ptr;
typedef std::vector< f_int_ptr > f_int_vec;
std::future can't be copied, so it's stored using a shared_ptr for ease of use in a vector, but there are various solutions.
Next, an example of using these for 10 threads of work
void executive_function()
{
// a vector of future pointers
f_int_vec future_list;
// start some threads
for( int n=0; n < 10; ++n )
{
// a packaged_task calling func
func_task ft( &func );
// get a future from the task as a shared_ptr
f_int_ptr future_ptr( new f_int( ft.get_future() ) );
// store the task for later use
future_list.push_back( future_ptr );
// launch a thread to call task
std::thread( std::move( ft )).detach();
}
// at this point, 10 threads are running
for( auto &d : future_list )
{
// for each future pointer, wait (block if required)
// for each thread's func to return
d->wait();
// get the result of the func return value
int res = d->get();
}
}
The point here is really in the last range-for loop. The vector stores futures, which the packaged_tasks provided. Those tasks are used to launch threads, and the future is key to synchronizing the executive. Once all threads are running, each is "waited on" with a simple call to the future's wait function, after which the return value of func can be obtained. No mutexes or condition_variables involved (that we know of).
This brings me to the subject of processing files in parallel, no matter how you launch a number of threads. If there were a machine which could handle 10,000 threads, then if each thread were a trivial file oriented operation there would be considerable RAM resources devoted to file processing, all duplicating each other. Depending on the API chosen, there are buffers associated with each read operation.
Let's say the file was 10 Mbytes, and 10,000 threads began operating on it, where each thread used 4 Kbyte buffers for processing. Combined, that suggests there would be 40 Mbytes of buffers to process a 10 Mbyte file. It would be less wasteful to simply read the file into RAM, and offer read only access to all threads from RAM.
That notion is further complicated by the fact that multiple tasks reading from various sections of the file at different times may cause heavy thrashing from a standard hard disk (not so for flash sources), if the disk cache can't keep up. More importantly, though, is that 10,000 threads are all calling system API's for reading the file, each with considerable overhead.
If the source material is a candidate for reading entirely into RAM, the threads could be focused on RAM instead of the file, alleviating that overhead, improving performance. The threads could share read access to the contents without locks.
If the source file is too large to read entirely into RAM, it may still be best read in blocks of the source file, have threads process that portion from a shared memory resource, then move to the next block in a series.

C++ std condition variable covering a lot of share variables

I am writing a multithreaded program in C++ and in my main thread I am waiting for my other threads to put packages in different queues. Depending on the sort of package and from which thread they originate.
The queues are protected by mutexes as it should be.
But in my main I don't want to be doing:
while(true)
if(!queue1->empty)
{
do stuff
}
if(!queue2->empty)
{
do stuff
}
etc
So you need to use condition variables to signal the main that something has changed. Now I can only block on 1 condition variable so I need all these threads to use the same conditional variable and an accompanying mutex. Now I dont want to really use this mutex to lock all my threads. It doesn't mean that when 1 thread is writing to a queue, another cant write to a totally different queue. So I use seperate mutexes for each queue. But now how do I use this additional mutex that comes with the conditional variable.
How it's done with 2 threads and 1 queue using boost, very simular to std.
http://www.justsoftwaresolutions.co.uk/threading/implementing-a-thread-safe-queue-using-condition-variables.html :
template<typename Data>
class concurrent_queue
{
private:
boost::condition_variable the_condition_variable;
public:
void wait_for_data()
{
boost::mutex::scoped_lock lock(the_mutex);
while(the_queue.empty())
{
the_condition_variable.wait(lock);
}
}
void push(Data const& data)
{
boost::mutex::scoped_lock lock(the_mutex);
bool const was_empty=the_queue.empty();
the_queue.push(data);
if(was_empty)
{
the_condition_variable.notify_one();
}
}
// rest as before
};
So how do you solve this?
I'd say the key to your problem is here:
Now I dont want to really use this mutex to lock all my threads. It doesn't mean that when 1 thread is writing to a queue, another cant write to a totally different queue. So I use seperate mutexes for each queue.
Why? Because:
... packages come in relatively slow. And queues are empty most of the time
It seems to me that you've designed yourself into a corner because of something you thought you needed when in reality you probably didn't need it because one queue would have actually worked in the usage scenario you mention.
I'd say start off with one queue and see how far it gets you. Then, when you run into a limitation where you really do have many threads waiting on a single mutex, you will have a lot more information about the problem and will therefore be able to solve it better.
In essence, I'd say the reason you're facing this problem is premature design optimization and the way to fix that would be to back track and change the design right now.
Create a top-level (possibly circular) queue of all the queues that have work in them.
This queue can be protected by a single mutex, and have a condvar which only needs to be signalled when it changes from empty to non-empty.
Now, your individual queues can each have their own mutex, and they only need to touch the shared/top-level queue (and its mutex) when they change from empty to non-empty.
Some details will depend on whether you want your thread to take only the front item from each non-empty queue in turn, or consume each whole queue in sequence, but the idea is there.
Going from non-empty to non-empty (but increased size) should also be passed on to the top level queue?
That depends, as I said, on how you consume them. If, every time a queue has something in it, you do this:
(you already have the top-level lock, that's how you know this queue has something in it)
lock the queue
swap the queue contents with a local working copy
remove the queue from the top-level queue
unlock the queue
then a work queue is always either non-empty, and hence in the top-level queue, or empty and hence not in the queue.
If you don't do this, and just pull the front element off each non-empty queue, then you have more states to consider.
Note that if
... packages come in relatively slow. And queues are empty most of the time
you could probably just have one queue, since there wouldn't be enough activity to cause a lot of contention. This simplifies things enormously.
Both #Carleeto and #Useless have given good answers. You have only a single consumer, so a single queue will give you the best performance. You can't get higher throughput than the single consumer working constantly, so your objective should be to minimize the locking overhead of the single consumer, not the producers. You do this by having the producer wait on a single condition variable (indicating that the queue is non-empty) with a single mutex.
Here's how you do the parametric polymorphism. Complete type safety, no casting, and only a single virtual function in the parent class:
class ParentType {
public:
virtual void do_work(...[params]...)=0;
virtual ~ParentType() {}
};
class ChildType1 : public ParentType {
private:
// all my private variables and functions
public:
virtual void do_work(...[params]...) {
// call private functions and use private variables from ChildType1
}
};
class ChildType2: public ParentType {
private:
// completely different private variables and functions
public:
virtual void do-work(...[params]...) {
// call private functions and use private variables from ChildType2
}
};

how to pass data to running thread

When using pthread, I can pass data at thread creation time.
What is the proper way of passing new data to an already running thread?
I'm considering making a global variable and make my thread read from that.
Thanks
That will certainly work. Basically, threads are just lightweight processes that share the same memory space. Global variables, being in that memory space, are available to every thread.
The trick is not with the readers so much as the writers. If you have a simple chunk of global memory, like an int, then assigning to that int will probably be safe. Bt consider something a little more complicated, like a struct. Just to be definite, let's say we have
struct S { int a; float b; } s1, s2;
Now s1,s2 are variables of type struct S. We can initialize them
s1 = { 42, 3.14f };
and we can assign them
s2 = s1;
But when we assign them the processor isn't guaranteed to complete the assignment to the whole struct in one step -- we say it's not atomic. So let's now imagine two threads:
thread 1:
while (true){
printf("{%d,%f}\n", s2.a, s2.b );
sleep(1);
}
thread 2:
while(true){
sleep(1);
s2 = s1;
s1.a += 1;
s1.b += 3.14f ;
}
We can see that we'd expect s2 to have the values {42, 3.14}, {43, 6.28}, {44, 9.42} ....
But what we see printed might be anything like
{42,3.14}
{43,3.14}
{43,6.28}
or
{43,3.14}
{44,6.28}
and so on. The problem is that thread 1 may get control and "look at" s2 at any time during that assignment.
The moral is that while global memory is a perfectly workable way to do it, you need to take into account the possibility that your threads will cross over one another. There are several solutions to this, with the basic one being to use semaphores. A semaphore has two operations, confusingly named from Dutch as P and V.
P simply waits until a variable is 0 and the goes on, adding 1 to the variable; V subtracts 1 from the variable. The only thing special is that they do this atomically -- they can't be interrupted.
Now, do you code as
thread 1:
while (true){
P();
printf("{%d,%f}\n", s2.a, s2.b );
V();
sleep(1);
}
thread 2:
while(true){
sleep(1);
P();
s2 = s1;
V();
s1.a += 1;
s1.b += 3.14f ;
}
and you're guaranteed that you'll never have thread 2 half-completing an assignment while thread 1 is trying to print.
(Pthreads has semaphores, by the way.)
I have been using the message-passing, producer-consumer queue-based, comms mechanism, as suggested by asveikau, for decades without any problems specifically related to multiThreading. There are some advantages:
1) The 'threadCommsClass' instances passed on the queue can often contain everything required for the thread to do its work - member/s for input data, member/s for output data, methods for the thread to call to do the work, somewhere to put any error/exception messages and a 'returnToSender(this)' event to call so returning everything to the requester by some thread-safe means that the worker thread does not need to know about. The worker thread then runs asynchronously on one set of fully encapsulated data that requires no locking. 'returnToSender(this)' might queue the object onto a another P-C queue, it might PostMessage it to a GUI thread, it might release the object back to a pool or just dispose() it. Whatever it does, the worker thread does not need to know about it.
2) There is no need for the requesting thread to know anything about which thread did the work - all the requestor needs is a queue to push on. In an extreme case, the worker thread on the other end of the queue might serialize the data and communicate it to another machine over a network, only calling returnToSender(this) when a network reply is received - the requestor does not need to know this detail - only that the work has been done.
3) It is usually possible to arrange for the 'threadCommsClass' instances and the queues to outlive both the requester thread and the worker thread. This greatly eases those problems when the requester or worker are terminated and dispose()'d before the other - since they share no data directly, there can be no AV/whatever. This also blows away all those 'I can't stop my work thread because it's stuck on a blocking API' issues - why bother stopping it if it can be just orphaned and left to die with no possibility of writing to something that is freed?
4) A threadpool reduces to a one-line for loop that creates several work threads and passes them the same input queue.
5) Locking is restricted to the queues. The more mutexes, condVars, critical-sections and other synchro locks there are in an app, the more difficult it is to control it all and the greater the chance of of an intermittent deadlock that is a nightmare to debug. With queued messages, (ideally), only the queue class has locks. The queue class must work 100% with mutiple producers/consumers, but that's one class, not an app full of uncooordinated locking, (yech!).
6) A threadCommsClass can be raised anytime, anywhere, in any thread and pushed onto a queue. It's not even necessary for the requester code to do it directly, eg. a call to a logger class method, 'myLogger.logString("Operation completed successfully");' could copy the string into a comms object, queue it up to the thread that performs the log write and return 'immediately'. It is then up to the logger class thread to handle the log data when it dequeues it - it may write it to a log file, it may find after a minute that the log file is unreachable because of a network problem. It may decide that the log file is too big, archive it and start another one. It may write the string to disk and then PostMessage the threadCommsClass instance on to a GUI thread for display in a terminal window, whatever. It doesn't matter to the log requesting thread, which just carries on, as do any other threads that have called for logging, without significant impact on performance.
7) If you do need to kill of a thread waiting on a queue, rather than waiing for the OS to kill it on app close, just queue it a message telling it to teminate.
There are surely disadvantages:
1) Shoving data directly into thread members, signaling it to run and waiting for it to finish is easier to understand and will be faster, assuming that the thread does not have to be created each time.
2) Truly asynchronous operation, where the thread is queued some work and, sometime later, returns it by calling some event handler that has to communicate the results back, is more difficult to handle for developers used to single-threaded code and often requires state-machine type design where context data must be sent in the threadCommsClass so that the correct actions can be taken when the results come back. If there is the occasional case where the requestor just has to wait, it can send an event in the threadCommsClass that gets signaled by the returnToSender method, but this is obviously more complex than simply waiting on some thread handle for completion.
Whatever design is used, forget the simple global variables as other posters have said. There is a case for some global types in thread comms - one I use very often is a thread-safe pool of threadCommsClass instances, (this is just a queue that gets pre-filled with objects). Any thread that wishes to communicate has to get a threadCommsClass instance from the pool, load it up and queue it off. When the comms is done, the last thread to use it releases it back to the pool. This approach prevents runaway new(), and allows me to easily monitor the pool level during testing without any complex memory-managers, (I usually dump the pool level to a status bar every second with a timer). Leaking objects, (level goes down), and double-released objects, (level goes up), are easily detected and so get fixed.
MultiThreading can be safe and deliver scaleable, high-performance apps that are almost a pleasure to maintain/enhance, (almost:), but you have to lay off the simple globals - treat them like Tequila - quick and easy high for now but you just know they'll blow your head off tomorrow.
Good luck!
Martin
Global variables are bad to begin with, and even worse with multi-threaded programming. Instead, the creator of the thread should allocate some sort of context object that's passed to pthread_create, which contains whatever buffers, locks, condition variables, queues, etc. are needed for passing information to and from the thread.
You will need to build this yourself. The most typical approach requires some cooperation from the other thread as it would be a bit of a weird interface to "interrupt" a running thread with some data and code to execute on it... That would also have some of the same trickiness as something like POSIX signals or IRQs, both of which it's easy to shoot yourself in the foot while processing, if you haven't carefully thought it through... (Simple example: You can't call malloc inside a signal handler because you might be interrupted in the middle of malloc, so you might crash while accessing malloc's internal data structures which are only partially updated.)
The typical approach is to have your thread creation routine basically be an event loop. You can build a queue structure and pass that as the argument to the thread creation routine. Then other threads can enqueue things and the thread's event loop will dequeue it and process the data. Note this is cleaner than a global variable (or global queue) because it can scale to have multiple of these queues.
You will need some synchronization on that queue data structure. Entire books could be written about how to implement your queue structure's synchronization, but the most simple thing would have a lock and a semaphore. When modifying the queue, threads take a lock. When waiting for something to be dequeued, consumer threads would wait on a semaphore which is incremented by enqueuers. It's also a good idea to implement some mechanism to shut down the consumer thread.

Lightest synchronization primitive for worker thread queue

I am about to implement a worker thread with work item queuing, and while I was thinking about the problem, I wanted to know if I'm doing the best thing.
The thread in question will have to have some thread local data (preinitialized at construction) and will loop on work items until some condition will be met.
pseudocode:
volatile bool run = true;
int WorkerThread(param)
{
localclassinstance c1 = new c1();
[other initialization]
while(true) {
[LOCK]
[unqueue work item]
[UNLOCK]
if([hasWorkItem]) {
[process data]
[PostMessage with pointer to data]
}
[Sleep]
if(!run)
break;
}
[uninitialize]
return 0;
}
I guess I will do the locking via critical section, as the queue will be std::vector or std::queue, but maybe there is a better way.
The part with Sleep doesn't look too great, as there will be a lot of extra Sleep with big Sleep values, or lot's of extra locking when Sleep value is small, and that's definitely unnecessary.
But I can't think of a WaitForSingleObject friendly primitive I could use instead of critical section, as there might be two threads queuing work items at the same time. So Event, which seems to be the best candidate, can loose the second work item if the Event was set already, and it doesn't guarantee a mutual exclusion.
Maybe there is even a better approach with InterlockedExchange kind of functions that leads to even less serialization.
P.S.: I might need to preprocess the whole queue and drop the obsolete work items during the unqueuing stage.
There are a multitude of ways to do this.
One option is to use a semaphore for the waiting. The semaphore is signalled every time a value is pushed on the queue, so the worker thread will only block if there are no items in the queue. This will still require separate synchronization on the queue itself.
A second option is to use a manual-reset event which is set when there are items in the queue and cleared when the queue is empty. Again, you will need to do separate synchronization on the queue.
A third option is to have an invisible message-only window created on the thread, and use a special WM_USER or WM_APP message to post items to the queue, attaching the item to the message via a pointer.
Another option is to use condition variables. The native Windows condition variables only work if you're targetting Windows Vista or Windows 7, but condition variables are also available for Windows XP with Boost or an implementation of the C++0x thread library. An example queue using boost condition variables is available on my blog: http://www.justsoftwaresolutions.co.uk/threading/implementing-a-thread-safe-queue-using-condition-variables.html
It is possible to share a resource between threads without using blocking locks at all, if your scenario meets certain requirements.
You need an atomic pointer exchange primitive, such as Win32's InterlockedExchange. Most processor architectures provide some sort of atomic swap, and it's usually much less expensive than acquiring a formal lock.
You can store your queue of work items in a pointer variable that is accessible to all the threads that will be interested in it. (global var, or field of an object that all the threads have access to)
This scenario assumes that the threads involved always have something to do, and only occasionally "glance" at the shared resource. If you want a design where threads block waiting for input, use a traditional blocking event object.
Before anything begins, create your queue or work item list object and assign it to the shared pointer variable.
Now, when producers want to push something onto the queue, they "acquire" exclusive access to the queue object by swapping a null into the shared pointer variable using InterlockedExchange. If the result of the swap returns a null, then somebody else is currently modifying the queue object. Sleep(0) to release the rest of your thread's time slice, then loop to retry the swap until it returns non-null. Even if you end up looping a few times, this is many. many times faster than making a kernel call to acquire a mutex object. Kernel calls require hundreds of clock cycles to transition into kernel mode.
When you successfully obtain the pointer, make your modifications to the queue, then swap the queue pointer back into the shared pointer.
When consuming items from the queue, you do the same thing: swap a null into the shared pointer and loop until you get a non-null result, operate on the object in the local var, then swap it back into the shared pointer var.
This technique is a combination of atomic swap and brief spin loops. It works well in scenarios where the threads involved are not blocked and collisions are rare. Most of the time the swap will give you exclusive access to the shared object on the first try, and as long as the length of time the queue object is held exclusively by any thread is very short then no thread should have to loop more than a few times before the queue object becomes available again.
If you expect a lot of contention between threads in your scenario, or you want a design where threads spend most of their time blocked waiting for work to arrive, you may be better served by a formal mutex synchronization object.
The fastest locking primitive is usually a spin-lock or spin-sleep-lock. CRITICAL_SECTION is just such a (user-space) spin-sleep-lock.
(Well, aside from not using locking primitives at all of course. But that means using lock-free data-structures, and those are really really hard to get right.)
As for avoiding the Sleep: have a look at condition-variables. They're designed to be used together with a "mutex", and I think they're much easier to use correctly than Windows' EVENTs.
Boost.Thread has a nice portable implementation of both, fast user-space spin-sleep-locks and condition variables:
http://www.boost.org/doc/libs/1_44_0/doc/html/thread/synchronization.html#thread.synchronization.condvar_ref
A work-queue using Boost.Thread could look something like this:
template <class T>
class Queue : private boost::noncopyable
{
public:
void Enqueue(T const& t)
{
unique_lock lock(m_mutex);
// wait until the queue is not full
while (m_backingStore.size() >= m_maxSize)
m_queueNotFullCondition.wait(lock); // releases the lock temporarily
m_backingStore.push_back(t);
m_queueNotEmptyCondition.notify_all(); // notify waiters that the queue is not empty
}
T DequeueOrBlock()
{
unique_lock lock(m_mutex);
// wait until the queue is not empty
while (m_backingStore.empty())
m_queueNotEmptyCondition.wait(lock); // releases the lock temporarily
T t = m_backingStore.front();
m_backingStore.pop_front();
m_queueNotFullCondition.notify_all(); // notify waiters that the queue is not full
return t;
}
private:
typedef boost::recursive_mutex mutex;
typedef boost::unique_lock<boost::recursive_mutex> unique_lock;
size_t const m_maxSize;
mutex mutable m_mutex;
boost::condition_variable_any m_queueNotEmptyCondition;
boost::condition_variable_any m_queueNotFullCondition;
std::deque<T> m_backingStore;
};
There are various ways to do this
For one you could create an event instead called 'run' and then use that to detect when thread should terminate, the main thread then signals. Instead of sleep you would then use WaitForSingleObject with a timeout, that way you will quit directly instead of waiting for sleep ms.
Another way is to accept messages in your loop and then invent a user defined message that you post to the thread
EDIT: depending on situation it may also be wise to have yet another thread that monitors this thread to check if it is dead or not, this can be done by the above mentioned message queue so replying to a certain message within x ms would mean that the thread hasn't locked up.
I'd restructure a bit:
WorkItem GetWorkItem()
{
while(true)
{
WaitForSingleObject(queue.Ready);
{
ScopeLock lock(queue.Lock);
if(!queue.IsEmpty())
{
return queue.GetItem();
}
}
}
}
int WorkerThread(param)
{
bool done = false;
do
{
WorkItem work = GetWorkItem();
if( work.IsQuitMessage() )
{
done = true;
}
else
{
work.Process();
}
} while(!done);
return 0;
}
Points of interest:
ScopeLock is a RAII class to make critical section usage safer.
Block on event until workitem is (possibly) ready - then lock while trying to dequeue it.
don't use a global "IsDone" flag, enqueue special quitmessage WorkItems.
You can have a look at another approach here that uses C++0x atomic operations
http://www.drdobbs.com/high-performance-computing/210604448
Use a semaphore instead of an event.
Keep the signaling and synchronizing separate. Something along these lines...
// in main thread
HANDLE events[2];
events[0] = CreateEvent(...); // for shutdown
events[1] = CreateEvent(...); // for work to do
// start thread and pass the events
// in worker thread
DWORD ret;
while (true)
{
ret = WaitForMultipleObjects(2, events, FALSE, <timeout val or INFINITE>);
if shutdown
return
else if do-work
enter crit sec
unqueue work
leave crit sec
etc.
else if timeout
do something else that has to be done
}
Given that this question is tagged windows, Ill answer thus:
Don't create 1 worker thread. Your worker thread jobs are presumably independent, so you can process multiple jobs at once? If so:
In your main thread call CreateIOCompletionPort to create an io completion port object.
Create a pool of worker threads. The number you need to create depends on how many jobs you might want to service in parallel. Some multiple of the number of CPU cores is a good start.
Each time a job comes in call PostQueuedCompletionStatus() passing a pointer to the job struct as the lpOverlapped struct.
Each worker thread calls GetQueuedCompletionItem() - retrieves the work item from the lpOverlapped pointer and does the job before returning to GetQueuedCompletionStatus.
This looks heavy, but io completion ports are implemented in kernel mode and represent a queue that can be deserialized into any of the worker threads associated with the queue (i.e. waiting on a call to GetQueuedCompletionStatus). The io completion port knows how many of the threads that are processing an item are actually using a CPU vs blocked on an IO call - and will release more worker threads from the pool to ensure that the concurrency count is met.
So, its not lightweight, but it is very very efficient... io completion port can be associated with pipe and socket handles for example and can dequeue the results of asynchronous operations on those handles. io completion port designs can scale to handling 10's of thousands of socket connects on a single server - but on the desktop side of the world make a very convenient way of scaling processing of jobs over the 2 or 4 cores now common in desktop PCs.

Wanted: Elegant solution to race condition

I have the following code:
class TimeOutException
{};
template <typename T>
class MultiThreadedBuffer
{
public:
MultiThreadedBuffer()
{
InitializeCriticalSection(&m_csBuffer);
m_evtDataAvail = CreateEvent(NULL, TRUE, FALSE, NULL);
}
~MultiThreadedBuffer()
{
CloseHandle(m_evtDataAvail);
DeleteCriticalSection(&m_csBuffer);
}
void LockBuffer()
{
EnterCriticalSection(&m_csBuffer);
}
void UnlockBuffer()
{
LeaveCriticalSection(&m_csBuffer);
}
void Add(T val)
{
LockBuffer();
m_buffer.push_back(val);
SetEvent(m_evtDataAvail);
UnlockBuffer();
}
T Get(DWORD timeout)
{
T val;
if (WaitForSingleObject(m_evtDataAvail, timeout) == WAIT_OBJECT_0) {
LockBuffer();
if (!m_buffer.empty()) {
val = m_buffer.front();
m_buffer.pop_front();
}
if (m_buffer.empty()) {
ResetEvent(m_evtDataAvail);
}
UnlockBuffer();
} else {
throw TimeOutException();
}
return val;
}
bool IsDataAvail()
{
return (WaitForSingleObject(m_evtDataAvail, 0) == WAIT_OBJECT_0);
}
std::list<T> m_buffer;
CRITICAL_SECTION m_csBuffer;
HANDLE m_evtDataAvail;
};
Unit testing shows that this code works fine when used on a single thread as long as T's default constructor and copy/assignment operators don't throw. Since I'm writting T, that is acceptable.
My problem is the Get method. If there is no data available (i.e. m_evtDataAvail is not set), then a couple of threads can block on the WaitForSingleObject call. When new data becomes available, they all fall through to the Lock() call. Only one will pass and can get the data out and move on. After the Unlock() another thread can move on through and will find that there is no data. Currently it will return the default object.
What I want to happen is for that second thread (and others) to go back to the WaitForSingleObject call. I could add an else block that unlocked and did a goto but that just feels evil.
That solution also adds the possibility for an endless loop since each trip back would restart the timeout. I could add some code to check the clock on entry and adjust the timeout on each trip back but then this simple Get method starts to get very complicated.
Any ideas on how to solve these problems while maintaining testability and simplicity?
Oh, for anyone wondering, the IsDataAvail function only exists for testing. It won't be used in production code. The Add and Get are the only methods that will be used in a non-testing environment.
You need to create a auto-reset event instead of a manual reset event. This guarantees that if multiple threads are waiting on an event, and when the event is set only one thread will be released. All other threads will remain in waiting state. You can create auto-reset event by passing FALSE to the second parameter of CreateEvent API. Also, note that this code is not exception safe i.e. after locking the buffer, if some statement throws an exception your critical section will not be unlocked. Use RAII principle to ensure that your critical section gets unlocked even in the case of exceptions.
You could use a Semaphore object instead of a generic Event object. The semaphore count should be initialized to 0 and incremented by 1 with ReleaseSemaphore each time Add is called. That way the WaitForSingleObject in Get will never release more threads to read from the buffer than there are values in the buffer.
You will always have to code for the case the event is signaled but there is not data, even WITH auto-reset events. There is a race condition from the moment WaitForsingleevent wakes until the LockBuffer is called, and in that interval another thread can pop the data from the buffer. Your code must place WaitForSingleEvent in a loop. Decrease the timeout with the time already spent in each loop iteration...
As an alternative, may I interest you in more scalable and performant alternatives? Interlocked Singly Linked Lists, OS thread pool QueueUserWorkItem and idempotent processing. Add pushes an entry into the list and submits a work item. The work item pops an entry and if not NULL, process it. You can go fancy and have extra logic for the processor to loop and keep a state marking its 'active' presence so that the Add does not quuee unnecessary work items, but that is not strictly required. For even higher sclae and multi-core/multi-cpu load spread I recommend using queued completion ports. The details are described in Rick Vicik's articles, I have a blog entry that links all 3 at once: High Performance Windows programs.