Embedded System State Pattern: Storing information related to asynchronous events - c++

I have a embedded system software, where I go through bunch of hardware initialization steps and then go to either Mode1 or Mode2 depending on history of events that occurred. Even within a certain mode, I do certain things depending on the history of events
e.g.
if my display is off then in Mode 1 I take a flow which is different than what I would taken if the display was on. And the display notification arrives asynchronously, I don't explicitly query for that information.
There are few other similar events that arrive asynchronously that can change the course of action I would take further down the flow.
I am trying to understand how to store the information related to these events that happened in the past. I am inclined to store them as flags but then it defeats the purpose of using state pattern (and it is also error prone).
Another option I have is store these information in the state itself e.g.
Mode1_DisplayOff_Atrribx, Mode1_DisplayOff_Atrribx, Mode2_DisplayOff_Atrribx, Mode2_DisplayOff_Atrriby. But I fear this will make the state machine complex.
What should be the right approach here?
(Question is not necessarily related to embedded systems)

Some general design advise for this case, or for any case with complex state handling:
Overall design advise: keep it as simple as possible. Strive for simplicity, not complexity.
Create a state machine corresponding to the number of combinations of "modes", "events", "flags" etc that you have. This can be as simple or as advanced as needed - sometimes you might need to implement sub-states (for example in state "error", there are sub states "error display" and "error adc" etc).
Avoid a jungle of flags: it is bad enough that you have to collect flags from numerous sources. If you also combine this with local decision makers, you will have to write complex code all over your program, where the local decision makers cause state changes. It will very soon turn impossible to keep track of program flow and code coverage.
This also tends to lead to tight coupling between a lot of modules that could have been independent of each other, which is always a very bad thing.
Implement a standardized error handler for all states, with a standard error data type (error number, error origin etc).
Keep the asynchronous information gathering separate from the state machine. That is, if the asynchronous events happen independently of which state the program is currently executing. If they don't, you'll have to integrate them in the state machine.
Centralize the decision making of what to do next, at one single place in the program. Preferably combined with the error handler.
Your main loop would then look something like:
for(;;)
{
state_result = state_machine[current_state]();
event_result = gather_event(); // might need several of these
current_state = evaluate_results (state_result, event_result);
}
Where evaluate_results is the only place in your program where state changes are allowed to occur. This function is only concerned with what state to execute next, it does not perform any actual work.

Related

Event Sourcing/CQRS doubts about aggregates, atomicity, concurrency and eventual consistency

I'm studying event sourcing and command/query segregation and I have a few doubts that I hope someone with more experience will easily answer:
A) should a command handler work with more than one aggregate? (a.k.a. should they coordinate things between several aggregates?)
B) If my command handler generates more than one event to store, how do you guys push all those events atomically to the event store? (how can I garantee no other command handler will "interleave" events in between?)
C) In many articles I read people suggest using optimistic locking to write the new events generated, but in my use case I will have around 100 requests / second. This makes me think that a lot of requests will just fail at huge rates (a lot of ConcurrencyExceptions), how you guys deal with this?
D) How to deal with the fact that the command handler can crash after storing the events in the event store but before publishing them to the event bus? (how to eventually push those "confirmed" events back to the event bus?)
E) How you guys deal with the eventual consistency in the projections? you just live with it? or in some cases people lock things there too? (waiting for an update for example)
I made a sequence diagram to better ilustrate all those questions
(and sorry for the bad english)
If my command handler generates more than one event to store, how do you guys push all those events atomically to the event store?
Most reasonable event store implementations will allow you to batch multiple events into the same transaction.
In many articles I read people suggest using optimistic locking to write the new events generated, but in my use case I will have around 100 requests / second.
If you have lots of parallel threads trying to maintain a complex invariant, something has gone badly wrong.
For "events" that aren't expected to establish or maintain any invariant, then you are just writing things to the end of a stream. In other words, you are probably not trying to write an event into a specific position in the stream. So you can probably use batching to reduce the number of conflicting writes, and a simple retry mechanism. In effect, you are using the same sort of "fan-in" patterns that appear when you have concurrent writers inserting into a queue.
For the cases where you are establishing/maintaining an invariant, you don't normally have many concurrent writers. Instead, specific writers have authority to write events (think "sharding"); the concurrency controls there are primarily to avoid making a mess in abnormal conditions.
How to deal with the fact that the command handler can crash after storing the events in the event store but before publishing them to the event bus?
Use pull, rather than push, as the primary subscription mechanism. Make sure that subscribers can handle duplicate messages safely (aka "idempotent"). Don't use a message subscription that can re-order events when you need events strictly ordered.
How you guys deal with the eventual consistency in the projections? you just live with it?
Pretty much. Views and reports have metadata information in them to let you know at what fixed point in "time" the report was accurate.
Unless you lock out all writers while a report is being consumed, there's a potential for any data being out of date, regardless of whether you are using events vs some other data model, regardless of whether you are using a single data model or several.
It's all part of the tradeoff; we accept that there will be a larger window between report time and current time in exchange for lower response latency, an "immutable" event history, etc.
should a command handler work with more than one aggregate?
Probably not - which isn't the same thing as always never.
Usual framing goes something like this: aggregate isn't a domain modeling pattern, like entity. It's a lifecycle pattern, used to make sure that all of the changes we make at one time are consistent.
In the case where you find that you want a command handler to modify multiple domain entities at the same time, and those entities belong to different aggregates, then have you really chosen the correct aggregate boundaries?
What you can do sometimes is have a single command handler that manages multiple transactions, updating a different aggregate in each. But it might be easier, in the long run, to have two different command handlers that each receive a copy of the command and decide what to do, independently.

Is it possible to cause multi processes hibernate (core dump?)?

I have a software (c++) that runs few processes (each process is a major system itself).
The processes have communication with each other via xml-rpc or boost asio
I want to be able to freeze or stop all processes at a given moment and be able to raise the system (all processes) later to the same state as before hibernating.
How can I do that in c++?
Would it be feasible due to the fact that the processes communicates with each other?
The big picture is that you need to get the system to a stable consistent state, then persist that state in some re-creatable form.
You can in principle write such code, the degree of difficulty depends on your application. You will need to figure out things such as:
How the processes agree that they are in a consistent state. You may need to define some new "Get ready to hibernate" and "I'm ready" messages.
For each process you need to figure out how to persist and recover it's state. Depending upon the complexity of any live data structures that may be quite tricky. On the other hand, if your processes are stateless then this could be really easy.
You'll need to devise a scheme for managing the sets of hibernated data, how you determine a consistent set across all the processes.
I see this as significant coding effort, the degree of difficulty will depend on the complexity of your application and the quality of its implementation. In a well structured application such major "replumbing" exercises often go surprisingly simply.
Unless you're an OS - no, it won't be possible.
What you need to do instead is to make sure that each process can do it for itself (i.e.: write a functionality that allows saving and restoring the states for each of the processes), and also to accommodate for inconsistencies in the communication (for example - to ensure ACK on the messages, and resend if saved state without receiving ACK).
It's feasible if done right, but it's easier said than done, of course, and assumes you can actually change the processes.
Well,
the other answers are fine. There is another rather "exotic" way which may solve this quickly, but it may be overkill or not suitable. But who knows ? So just in case...
I suggest to run your program into a virtual machine (I mean for example a linux with vmware) and pause/wake up this virtual machine at will.
If you are using an inter-process communication method which is not disrupted by this kind of operation, it may work and save you a lot of time.
Good luck.

Controlled application shut-down strategy

Our (Windows native C++) app is composed of threaded objects and managers. It is pretty well written, with a design that sees Manager objects controlling the lifecycle of their minions. Various objects dispatch and receive events; some events come from Windows, some are home-grown.
In general, we have to be very aware of thread interoperability so we use hand-rolled synchronization techniques using Win32 critical sections, semaphores and the like. However, occasionally we suffer thread deadlock during shut-down due to things like event handler re-entrancy.
Now I wonder if there is a decent app shut-down strategy we could implement to make this easier to develop for - something like every object registering for a shutdown event from a central controller and changing its execution behaviour accordingly? Is this too naive or brittle?
I would prefer strategies that don't stipulate rewriting the entire app to use Microsoft's Parallel Patterns Library or similar. ;-)
Thanks.
EDIT:
I guess I am asking for an approach to controlling object life cycles in a complex app where many threads and events are firing all the time. Giovanni's suggestion is the obvious one (hand-roll our own), but I am convinced there must be various off-the-shelf strategies or frameworks, for cleanly shutting down active objects in the correct order. For example, if you want to base your C++ app on an IoC paradigm you might use PocoCapsule instead of trying to develop your own container. Is there something similar for controlling object lifecycles in an app?
This seems like a special case of the more general question, "how do I avoid deadlocks in my multithreaded application?"
And the answer to that is, as always: make sure that any time your threads have to acquire more than one lock at a time, that they all acquire the locks in the same order, and make sure all threads release their locks in a finite amount of time. This rule applies just as much at shutdown as at any other time. Nothing less is good enough; nothing more is necessary. (See here for a relevant discussion)
As for how to best do this... the best way (if possible) is to simplify your program as much as you can, and avoid holding more than one lock at a time if you can possibly help it.
If you absolutely must hold more than one lock at a time, you must verify your program to be sure that every thread that holds multiple locks locks them in the same order. Programs like helgrind or Intel thread checker can help with this, but it often comes down to simply eyeballing the code until you've proved to yourself that it satisfies this constraint. Also, if you are able to reproduce the deadlocks easily, you can examine (using a debugger) the stack trace of each deadlocked thread, which will show where the deadlocked threads are forever-blocked at, and with that information, you can that start to figure out where the lock-ordering inconsistencies are in your code. Yes, it's a major pain, but I don't think there is any good way around it (other than avoiding holding multiple locks at once). :(
One possible general strategy would be to send an "I am shutting down" event to every manager, which would cause the managers to do one of three things (depending on how long running your event-handlers are, and how much latency you want between the user initiating shutdown, and the app actually exiting).
1) Stop accepting new events, and run the handlers for all events received before the "I am shutting down" event. To avoid deadlocks you may need to accept events that are critical to the completion of other event handlers. These could be signaled by a flag in the event or the type of the event (for example). If you have such events then you should also consider restructuring your code so that those actions are not performed through event handlers (as dependent events would be prone to deadlocks in ordinary operation too.)
2) Stop accepting new events, and discard all events that were received after the event that the handler is currently running. Similar comments about dependent events apply in this case too.
3) Interrupt the currently running event (with a function similar to boost::thread::interrupt()), and run no further events. This requires your handler code to be exception safe (which it should already be, if you care about resource leaks), and to enter interruption points at fairly regular intervals, but it leads to the minimum latency.
Of course you could mix these three strategies together, depending on the particular latency and data corruption requirements of each of your managers.
As a general method, use an atomic boolean to indicate "i am shutting down", then every thread checks this boolean before acquiring each lock, handling each event etc. Can't give a more detailed answer unless you give us a more detailed question.

Testing concurrent data structure

What are some methods for testing concurrent data structures to make sure the data structs behave correctly when accessed from multiple threads ?
All of the other answers have focused on actually testing the code by putting it through its paces and actually running it in one form or another or politely saying "don't do it yourself, use an existing library".
This is great and all, but IMO, the most important (practical tests are important too) test is to look at the code line by line and for every line of code ask "what happens if I get interrupted by another thread here?" Imagine another thread, running just about any of the other lines/functions during this interruption. Do things still stay consistent? When competing for resources, does the other thread[s] block or spin?
This is what we did in school when learning about concurrency and it is a surprisingly effective approach. Bottom line, I feel that taking the time to prove to yourself that things are consistent and work as expected in all states is the first technique you should use when dealing with this stuff.
Concurrent systems are probabilistic and errors are often difficult to replicate. Therefore you need to run various input/output cases, each tested over time (hours, days, etc) in order to detect possible errors.
Tests for concurrent data structure involves examining the container's state before and after expected events such as insert and delete.
Use a pre-existing, pre-tested library that meets your needs if possible.
Make sure that the code has appropriate self-consistency checks (preferably fast sanity checks), and run your code on as many different types of hardware as possible to help narrow down interesting timing problems.
Have multiple people peer review the code, preferably without a pre-explanation of how it's supposed to work. That way they have to grok the code which should help catch more bugs.
Set up a bunch of threads that do nothing but random operations on the data structures and check for consistency at some rate.
Start with the assumption that your calls to access/modify data are not thread safe and use locks to ensure only a single thread can access/modify any part of the data at a time. Only after you can prove to yourself that a specific type of access is safe outside of the lock by multiple threads at once should you move that code outside of the lock.
Assume worst case scenarios, e.g. that your code will stop right in the middle of some pointer manipulation or another critical point, and that another thread will encounter that data in mid-transition. If that would have a bad result, leave it within the lock.
I normally test these kinds of things by interjecting sleep() calls at appropriate places in the distributed threads/processes.
For instance, to test a lock, put sleep(2) in all your threads at the point of contention, and spawn two threads roughly 1 second apart. The first one should obtain the lock, and the second should have to wait for it.
Most race conditions can be tested by extending this method, but if your system has too many components it may be difficult or impossible to know every possible condition that needs to be tested.
Run your concurrent threads for one or a few days and look what happens. (Sounds strange, but finding out race conditions is such a complex topic that simply trying it is the best approach).

Multithreaded job queue manager

I need to manage CPU-heavy multitaskable jobs in an interactive application. Just as background, my specific application is an engineering design interface. As a user tweaks different parameters and options to a model, multiple simulations are run in the background and results displayed as they complete, likely even as the user is still editing values. Since the multiple simulations take variable time (some are milliseconds, some take 5 seconds, some take 10 minutes), it's basically a matter of getting feedback displayed as fast as possible, but often aborting jobs that started previously but are now no longer needed because of the user's changes have already invalidated them. Different user changes may invalidate different computations so at any time I may have 10 different simulations running. Somesimulations have multiple parts which have dependencies (simulations A and B can be seperately computed, but I need their results to seed simulation C so I need to wait for both A and B to finish first before starting C.)
I feel pretty confident that the code-level method to handle this kind of application is some kind of multithreaded job queue. This would include features of submitting jobs for execution, setting task priorities, waiting for jobs to finish, specifying dependencies (do this job, but only after job X and job Y have finished), canceling subsets of jobs that fit some criteria, querying what jobs remain, setting worker thread counts and priorities, and so on. And multiplatform support is very useful too.
These are not new ideas or desires in software, but I'm at the early design phase of my application where I need to make a choice about what library to use for managing such tasks. I've written my own crude thread managers in the past in C (I think it's a rite of passage) but I want to use modern tools to base my work on, not my own previous hacks.
The first thought is to run to OpenMP but I'm not sure it's what I want. OpenMP is great for parallelizing at a fine level, automatically unrolling loops and such. While multiplatform, it also invades your code with #pragmas. But mostly it's not designed for managing large tasks.. especially cancelling pending jobs or specifying dependencies. Possible, yes, but it's not elegant.
I noticed that Google Chrome uses such a job manager for even the most trivial tasks. The design goal seems to be to keep the user interaction thread as light and nimble as possible, so anything that can get spawned off asynchronously, should be. From looking at the Chrome source this doesn't seem to be a generic library, but it still is interesting to see how the design uses asynchronous launches to keep interaction fast. This is getting to be similar to what I'm doing.
There are a still other options:
Surge.Act: a Boost-like library for defining jobs. It builds on OpenMP, but does allow chaining of dependencies which is nice. It doesn't seem to feel like it's got a manager that can be queried, jobs cancelled, etc. It's a stale project so it's scary to depend on it.
Job Queue is quite close to what I'm thinking of, but it's a 5 year old article, not a supported library.
Boost.threads does have nice platform independent synchronization but that's not a job manager. POCO has very clean designs for task launching, but again not a full manager for chaining tasks. (Maybe I'm underestimating POCO though).
So while there are options available, I'm not satisfied and I feel the urge to roll my own library again. But I'd rather use something that's already in existence. Even after searching (here on SO and on the net) I haven't found anything that feels right, though I imagine this must be a kind of tool that is often needed, so surely there's some community library or at least common design.
On SO there's been some posts about job queues, but nothing that seems to fit.
My post here is to ask you all what existing tools I've missed, and/or how you've rolled your own such multithreaded job queue.
We had to build our own job queue system to meet requirements similar to yours ( UI thread must always respond within 33ms, jobs can run from 15-15000ms ), because there really was nothing out there that quite met our needs, let alone was performant.
Unfortunately our code is about as proprietary as proprietary gets, but I can give you some of the most salient features:
We start up one thread per core at the beginning of the program. Each pulls work from a global job queue. Jobs consist of a function object and a glob of associated data (really an elaboration on a func_ptr and void *). Thread 0, the fast client loop, isn't allowed to work on jobs, but the rest grab as they can.
The job queue itself ought to be a lockless data structure, such as a lock-free singly linked list (Visual Studio comes with one). Avoid using a mutex; contention for the queue is surprisingly high, and grabbing mutexes is costly.
Pack up all the necessary data for the job into the job object itself -- avoid having pointer from the job back into the main heap, where you'll have to deal with contention between jobs and locks and all that other slow, annoying stuff. For example, all the simulation parameters should go into the job's local data blob. The results structure obviously needs to be something that outlives the job: you can deal with this either by a) hanging onto the job objects even after they've finished running (so you can use their contents from the main thread), or b) allocating a results structure specially for each job and stuffing a pointer into the job's data object. Even though the results themselves won't live in the job, this effectively gives the job exclusive access to its output memory so you needn't muss with locks.
Actually I'm simplifying a bit above, since we need to choreograph exactly which jobs run on which cores, so each core gets its own job queue, but that's probably unnecessary for you.
I rolled my own, based on Boost.threads. I was quite surprised by how much bang I got from writing so little code. If you don't find something pre-made, don't be afraid to roll your own. Between Boost.threads and your experience since writing your own, it might be easier than you remember.
For premade options, don't forget that Chromium is licensed very friendly, so you may be able to roll your own generic library around its code.
Microsoft is working on a set of technologies for the next Version of Visual Studio 2010 called the Concurrency Runtime, the Parallel Pattern Library and the Asynchronous Agents Library which will probably help. The Concurrency Runtime will offer policy based scheduling, i.e. allowing you to manage and compose multiple scheduler instances (similar to thread pools but with affinitization and load balancing between instances), the Parallel Pattern Library will offer task based programming and parallel loops with an STL like programming model. The Agents library offers an actor based programming model and has support for building concurrent data flow pipelines, i.e. managing those dependencies described above. Unfortunately this isn't released yet, so you can read about it on our team blog or watch some of the videos on channel9 there is also a very large CTP that is available for download as well.
If you're looking for a solution today, Intel's Thread Building Blocks and boost's threading library are both good libraries and available now. JustSoftwareSolutions has released an implementation of std::thread which matches the C++0x draft and of course OpenMP is widely available if you're looking at fine-grained loop based parallelism.
The real challenge as other folks have alluded to is to correctly identify and decompose work into tasks suitable for concurrent execution (i.e. no unprotected shared state), understand the dependencies between them and minimize the contention that can occur on bottlenecks (whether the bottleneck is protecting shared state or ensuring the dispatch loop of a work queue is low contention or lock-free)... and to do this without scheduling implementation details leaking into the rest of your code.
-Rick
Would something like threadpool be useful to you? It's based on boost::threads and basically implements a simple thread task queue that passes worker functions off to the pooled threads.
I've been looking for near the same requirements. I'm working on a game with 4x-ish mechanics and scheduling different parts of what gets done almost exploded my brain. I have a complex set of work that needs to get accomplished at different time resolutions, and to a different degree of actual simulation depending on what system/region the player has actively loaded. This means as the player moves from system to system, I need to load a system to the current high resolution simulation, offload the last system to a lower resolution simulation, and do the same for active/inactive regions of systems. The different simulations are big lists of population, political, military, and economic actions based on profiles of each entity. I'm going to try to describe my issue and my approach so far and I hope it's useful at describe an alternative for you or someone else. The rough outline of the structure I'm building will use the following:
cpp-taskflow (A Modern C++ Parallel Task Programming Library) I'm going to make a library of modules that will be used as job construction parts. Each entry will have an API for initializing and destruction as well as pointers for communication. I'm hoping to write it in a way that they will be nest-able using the cpp-taskflow API to set-up all the dependencies at job creation time, but provide a means of live adjustment and having a kill-switch available. Most of what I'm making will be decision trees of state machines, or state machines of behavior trees so the job data structure will be settings and states of time-resolution tagged data pointing to actual stats and object values.
FlatBuffers I'm looking to use this library to build a "job list entry" as well as an "object wrapper" system. Each entry in the job queues will be a flatbuffer object describing the work needed done(settings for the module), as well as containing the data(or shared pointers to the data) for the work that needs done. The object storage flatbuffers will contain the data that represents entity tables. For me, most of the actual data will me arrays that need deciding/working on. I'm also looking to use flatbuffers as a communication/control channel between threads. I'm torn on making a master "router" thread all the others communicate through, or each one containing their own, and having some mechanism of discovery.
SQLite Since only the active regions/systems need higher resolution work done, some of the background job lists the game will create(for thousands of systems and their entities) will be pretty large and long lived. 100's of thousands - millions of jobs(big in my mind), each requiring an unknown amount of time to complete. In my case, I don't care when they get done, as long as they all do(long campains). I plan on each thread getting a table of an in-memory sqlite db as a job queue. Each entry will contain a blob of flatbuffer work, a pointer to a buffer to notify upon completion, a pointer to a control buffer for updates, and other fields decorating the job item(location, data ranges, priority) that will get filled as the job entry makes new jobs, and as the items are consumed into the database. This give me a way I can create relational ties between jobs and simply construct queries if I need to re-work/update jobs, remove them and their dependencies, or update/re-order priorities or dependencies. All this being used in an sqlite db also means that at any time I can dump the whole thing to disk and reload it later, or switch to attaching to and processing it from disk. Additionally, this gives me access to a lot of search and ordering algorithmic work I'd normally need a bunch of different types of containers for. Being able to use SQL queries gives me a lot of options to process the jobs.
The communication queue(as a db) is what I'm torn as to whether I should make access via the corresponding thread(each thread contains it's own messaging db, and the module API has locks/mutex abstracted for access), or have all updates, adds/removes, and communication via some master router thread into one large db. I have no idea which will give me the least headaches as far as mutexing and locks. I got a few days into making a monster spaghetti beast of shared pointers to sbuffer pools and lookup tables, so each thread had it's own buffer in, and separate out buffers. That's when I decided to just offload the giant list keeping to sqlite. Then I thought, why not just feed the flatbuffer objects of everything else into tables.
Having almost everything in a db means from each module, I can write sql statements that represent the view of the data I need to work on as well as pivot on the fly as to how the data is worked on. Having the jobs themselves in a db means I can do the same for them as well. SQLite has multi-threading access, so using it as a Multithreaded job queue manager shouldn't be too much of a stretch.
In summary, Cpp-Taskflow will allow you to setup complicated nested loops with dependency chaining and job-pool multithreading. Out of the box it comes with most of the structure you need. FlatBuffers will allow you to create job declarations and object wrappers easy to feed into stream-buffers as one unit of work and pass them between job threads, and SQLite will allow you to tag and queue the stream-buffer jobs into blob entries in a way that should allow adding, searching, ordering, updating, and removal with minimal work on your end. It also makes saving and reloading a breeze. Snapshots and roll-backs should also be doable, you just have to keep your mind wrapped around the order and resolution of events for the db.
Edit: Take this with a grain of salt though, I found your question because I'm trying to accomplish what Crashworks described. I'm thinking of using affinity to open long living threads and have the master thread run the majority of the Cpp-Taskflow hierarchy work, feeding jobs to the others. I've yet to use the sqlite meothod of job-queue/control communication, that's just my plan so far.
I hope someone finds this helpful.
You might want to look at Flow-Based Programming - it is based on data chunks streaming between asynchronous components. There are Java and C# versions of the driver, plus a number of precoded components. It is intrinsically multithreaded - in fact the only single-threaded code is within the components, although you can add timing constraints to the standard scheduling rules. Although it may be at too fine-grained a level for what you need, there may be stuff here you can use.
Take a look at boost::future (but see also this discussion and proposal) which looks like a really nice foundation for parallelism (in particular it seems to offer excellent support for C-depends-on-A-and-B type situations).
I looked at OpenMP a bit but (like you) wasn't convinced it would work well for anything but Fortran/C numeric code. Intel's Threading Building Blocks looked more interesting to me.
If it comes to it, it's not too hard to roll your own on top of boost::thread.
[Explanation: a thread farm (most people would call it a pool) draws work from a thread-safe queue of functors (tasks or jobs). See the tests and benchmark for examples of use. I have some extra complication to (optionally) support tasks with priorities, and the case where executing tasks can spawn more tasks into the work queue (this makes knowing when all the work is actually completed a bit more problematic; the references to "pending" are the ones which can deal with the case). Might give you some ideas anyway.]
You may like to look at Intel Thread Building Blocks. I beleave it does what you want and with version 2 it's Open Source.
There's plenty of distributed resource managers out there. The software that meets nearly all of your requirements is Sun Grid Engine. SGE is used on some of the worlds largest supercomputers and is in active development.
There's also similar solutions in Torque, Platform LSF, and Condor.
It sounds like you may want to roll your own but there's plenty of functionality in all of the above.
I don't know if you're looking for a C++ library (which I think you are), but Doug Lea's Fork/Join framework for Java 7 is pretty nifty, and does exactly what you want. You'd probably be able to implement it in C++ or find a pre-implemented library.
More info here:
http://artisans-serverintellect-com.si-eioswww6.com/default.asp?W1
A little late to the punch perhaps, but take a look also at ThreadWeaver:
http://en.wikipedia.org/wiki/ThreadWeaver