Event-driven simulation class

Event-driven simulation class - c++

I am working through some of the exercises in The C++ Programming Language by Bjarne Stroustrup. I am confused by problem 11 at the end of Chapter 12:
(*5) Design and implement a library for writing event-driven simulations. Hint: <task.h>. ... An object of class task should be able to save its state and to have that state restored so that it can operate as a coroutine. Specific tasks can be defined as objects of classes derived from task. The program to be executed by a task might be defined as a virtual function. ... There should be a scheduler implementing a concept of virtual time. ... The tasks will need to communicate. Design a class queue for that. ...
I am not sure exactly what this is asking for. Is a task a separate thread? (As far as I know it is not possible to create a new thread without system calls, and since this is a book about C++ I do not believe that is the intent.) Without interrupts, how is it possible to start and stop a running function? I assume this would involve busy waiting (which is to say, continually loop and check a condition) although I cannot see how that could be applied to a function that might not terminate for some time (if it contains an infinite loop, for example).
EDIT: Please see my post below with more information.

Here's my understanding of an "event-driven simulation":
A controller handles an event queue, scheduling events to occur at certain times, then executing the top event on the queue.
Events ocur instantaneously at the scheduled time. For example, a "move" event would update the position and state of an entity in the simulation such that the state vector is valid at the current simulation time. A "sense" event would have to make sure all entities' states are at the current time, then use some mathematical model to evaluate how well the current entity can sense the other entities. (Think robots moving around on a board.)
Thus time progresses discontinuously, jumping from event to event. Contrast this with a time-driven simulation, where time moves in discrete steps and all entities' states are updated every time step (a la most Simulink models).
Events can then occur at their natural rate. It usually doesn't make sense to recompute all data at the finest rate in the simulation.
Most production event-driven simulations run in a single thread. They can be complex by their very nature, so trying to synchronize a multi-threaded simulation tends to add exponential layers of complexity. With that said, there's a standard for multi-process military simulations called Distributive Interactive Simulation (DIS) that uses predefined TCP messages to transmit data between processes.
EDIT: It's important to define a difference between modeling and simulation. A model is a mathematical representation of a system or process. A simulation is built from one or more models that are executed over a period of time. Again, an event driven simulation hops from event to event, while a time driven simulation proceeds at a constant time step.

Hint: <task.h>.
is a reference to an old cooperative multi-tasking library that shipped with early versions of CFront (you can also download at that page).
If you read the paper "A Set of C++ Classes for Co-routine Style Programming" things will make a lot more sense.
Adding a bit:
I'm not an old enough programmer to have used the task library. However, I know that C++ was designed after Stroustrup wrote a simulation in Simula that had many of the same properties as the task library, so I've always been curious about it.
If I were to implement the exercise from the book, I would probably do it like this (please note, I haven't tested this code or even tried to compile it):
class Scheduler {
std::list<*ITask> tasks;
public:
void run()
{
while (1) // or at least until some message is sent to stop running
for (std::list<*ITask>::iterator itor = tasks.begin()
, std::list<*ITask>::iterator end = tasks.end()
; itor != end
; ++itor)
(*itor)->run(); // yes, two dereferences
}
void add_task(ITask* task)
{
tasks.push_back(task);
}
};
struct ITask {
virtual ~ITask() { }
virtual void run() = 0;
};
I know people will disagree with some of my choices. For instance, using a struct for the interface; but structs have the behavior that inheriting from them is public by default (where inheriting from classes is private by default), and I don't see any value in inheriting privately from an interface, so why not make public inheritance the default?
The idea is that calls to ITask::run() will block the scheduler until the task arrives at a point where it can be interrupted, at which point the task will return from the run method, and wait until the scheduler calls run again to continue. The "cooperative" in "cooperative multitasking" means "tasks say when they can be interrupted" ("coroutine" usually means "cooperative multitasking"). A simple task may only do one thing in its run() method, a more complex task may implement a state machine, and may use its run() method to figure out what state the object is currently in and make calls to other methods based on that state. The tasks must relinquish control once in a while for this to work, because that is the definition of "cooperative multitasking." It's also the reason why all modern operating systems don't use cooperative multitasking.
This implementation does not (1) follow fair scheduling (maybe keeping a running total of clock ticks spent in in task's run() method, and skipping tasks that have used too much time relative to the others until the other tasks "catch up"), (2) allow for tasks to be removed, or even (3) allow for the scheduler to be stopped.
As for communicating between tasks, you may consider looking at Plan 9's libtask or Rob Pike's newsqueak for inspiration (the "UNIX implementation of Newsqueak" download includes a paper, "The Implementation of Newsqueak" that discusses message passing in an interesting virtual machine).
But I believe this is the basic skeleton Stroustrup had in mind.

Sounds to me like the exercise is asking you to implement a cooperative multi-tasking scheduler. The scheduler operates in virtual time (time ticks you define/implement at whatever level you want), chooses a task to run based on the queue (note that the description mentioned you'd need to implement one), and when the current task is done, the scheduler selects the next one and starts it running.

The generalised structure of a discrete event simulation is based on a priority queue keyed on a time value. At a broad level it goes like:
While (not end condition):
Pop next event (one with the lowest time) from the priority queue
Process that event, which may generate more events
If a new event is generated:
Place this on the priority queue keyed at its generated time
Co-routines change the view of the model from being event-centric to being entity-centric. Entities can go through some life-cycle (e.g. accept job, grab resource X, process job, release resource X, place job in queue for next step). This is somewhat easier to program as the grab resources are handled with semaphore-like synchronisation primitives. The jobs and synchronisation primitives generate the events and queue them behind the scenes.
This gives a model conceptually similar to processes in an operating system and a scheduler waking the process up when its input or a shared resource it has requested is available. The co-routine model makes the simulation quite a lot easier to understand, which is useful for simulating complex systems.

(I'm not a C++ dev)
Probably what it means is that you need to create a class Task (as in Event) that will consist mostly of a callback function pointer and a scheduled time, and can be stored in a list in the Scheduler class, which in turn basically should keep track of a time counter and call each Task's function when the time arrives. These tasks should be created by the Objects of the simulation.
If you need help on the discrete simulation side, go ahead and edit the question.

This is in response to titaniumdecoy's comment to SottieT812's answer. Its much too large for a comment, so I decided to make it another answer.
It is event driven in the sense that simulation state only changes in response to an event. For example, assume that you have two events missile launch and missile impact. When the launch event is executed, it figures out when and where it will impact, and schedules an impact event for the appropriate time. The position of the missile is not calculated between the launch and impact, although it will probably have a method that can be called by other objects to get the position at a particular time.
This is in contrast to a time driven simulation, where the exact position of the missile (and every other object in the simulation) is calculated after every time step, say 1 second.
Depending on the characteristics of the model, the fidelity of the answer required, and many other factors, either event driven or time driven simulation may perform better.
Edit: If anyone is interested in learning more, check out the papers from the Winter Simulation Conference

There is a book and framework called DEMOS (Discrete Event Modelling on Simula) that describes a co-routine based framework (the eponymous DEMOS). Despite being 30 years or so old DEMOS is actually quite a nice system, and Graham Birtwistle is a really nice guy.
If you implement co-routines on C++ (think setjump/longjump) you should take a look at this book for a description of a really, really elegant discrete event modelling framework. Although it's 30 years old it's a bit of a timeless classic and still has a fan base.

In the paper linked to by "me.yahoo.com/..." which describes the task.h class:
Tasks execute in parallel
A task may be suspended and resumed later
The library is described as a method of multiprogramming.
Is it possible to do this without using threads or separate processes?

Related

What is the executor pattern in a C++ context?

The author of asio, Christopher Kohlhoff, is working on a library and proposal for executors in C++. His work so far includes this repo and docs. Unfortunately, the rationale portion has yet to be written. So far, the docs give a few examples of what the library does but I don't feel like I'm missing something. Somehow this is more than a family of fancy invoker functions.
Everything I can find on Google is very Java specific and a lot of it is particular to specific frameworks so I'm having trouble figuring out what this "executor pattern" is all about.
What are executors in this context? What do they do? What are the canonical examples of when they would be helpful? What variations exist among executors? What are the alternatives to executors and how do they compare? In particular, there seems to be a lot of overlap with an event loop where the events are initial input events, execution events, and a shutdown event.
When trying to figure out new abstractions I usually find understanding the motivation key. So for executors, what are we trying to abstract and why? What are we trying to make generic? Without executors, what extra work would we have to do?

The most basic benefit of executors is separating the definition of a program's parallelism from how it's used. Java's executor model exists because, by and large, you don't actually know, when you're first writing code, what parallelism model is best for your scenario. You might have little to gain from parallelism and shouldn't use threads at all, you might do best with a long running dedicated worker thread for each core, or a dynamically scaling pool of threads based on current load that cleans up threads after they've been idle a while to reduce memory usage, context switches, etc., or maybe just launching a thread for every task on demand, exiting when the task is done.
The key here is it's nigh impossible to know which approach is best when you're first writing code. You may know where parallelism might help you, but in traditional threading, you end up intermingling the parallelism "configuration" (when and whether to create threads) with the use of parallelism (determining which functions to call with what arguments). When you do mix the code like this, it's a royal pain to do performance testing of different options, because each and every thread launch is independent, and must be updated separately.
The main benefit of the executor model is that the parallelism configuration is done in one place (where the executor is created), and the users of that executor don't have to know anything about it. They just submit work to the executor, receive a future, and at some later point, retrieve the result (blocking if necessary) from the future. If you want to experiment with other configurations, you change the one line defining the executor and run your code again. Even if you decide you need to use different parallelism models for different sections of your code, refactoring to add a second executor and change some of the users of the first executor to use the second is easy compared to manually rewriting the threading details of every site; as long as the executor's name is (relatively) unique, finding users and changing them to use a different one is pretty easy. Executors both simplify your code (by avoiding intermingling thread creation/management with the tasks the threads do) and simplify performance testing.
As a side-benefit, you also abstract away the complexities of transferring data into and out of a worker thread (the submit method encapsulates the former, the future's result method encapsulates the latter). std::async gets you some of this benefit, but with no real control over the parallelism involved (just a yes/no/maybe choice of whether to force a thread, force deferred execution in the current thread, or let the compiler/library decide, with no fine grained control over whether a thread pool is used, and if so, how it behaves). A true executor framework gives you the control std::async fails to provide, with similar ease of use.

Threading vs Task-Based vs Asynchronous Programming

I'm new to this concept. Are these the same or different things? What is the difference? I really like the idea of being able to run two processes at once, for example if I have several large files to load into my program I'd love to load as many of them simultaneously as possible instead of waiting for one at a time. And when working with a large file, such as wav file, it would be great to break it into pieces and do processing on several chunks at once and then put them back together. What do I want to look into to learn how to do this sort of thing?
Edit: Also, I know using more than one core on a multicore processor fits in here somewhere, but apparently asynchronous programming doesn't necessarily mean you are using multiple cores? Why would you do this if you didn't have multiple cores to take advantage of?

They are related but different.
Threading, normally called multi-threading, refers to the use of multiple threads of execution within a single process. This usually refers to the simple case of using a small set of threads each doing different tasks that need to be, or could benefit from, running simultaneously. For example, a GUI application might have one thread draw elements, another thread respond to events like mouse clicks, and another thread do some background processing.
However, when the number of threads, each doing their own thing, is taken to an extreme, we usually start to talk about an Agent-based approach.
The task-based approach refers to a specific strategy in software engineering where, in abstract terms, you dynamically create "tasks" to be accomplished, and these tasks are picked up by a task manager that assigns the tasks to threads that can accomplish them. This is more of a software architectural thing. The advantage here is that the execution of the whole program is a succession of tasks being relayed (task A finished -> trigger task B, when both task B and task C are done -> trigger task D, etc..), instead of having to write a big function or program that executes each task one after the other. This gives flexibility when it is unclear which tasks will take more time than others, and when tasks are only loosely coupled. This is usually implemented with a thread-pool (threads that are waiting to be assigned a task) and some message-passing interface (MPI) to communicate data and task "contracts".
Asynchronous programming does not refer to multi-threaded programming, although the two are very often associated (and work well together). A synchronous program must complete each step before moving on to the next. An asynchronous program starts a step, moves on to other steps that don't require the result of the first step, then checks on the result of the first step when its result is required.
That is, a synchronous program might go a little bit like this: "do this task", "wait until done", "do something with the result", and "move on to something else". By contrast, an asynchronous program might go a little more like this: "I'm gonna start a task, and I'll need the result later, but I don't need it just now", "in the meantime, I'll do something else", "I can't do anything else until I have the result of the first step now, so I'll wait for it, if it isn't ready", and "move on to something else".
Notice that "asynchronous" refers to a very broad concept, that always involves some form of "start some work and tell me when it's done" instead of the traditional "do it now!". This does not require multi-threading, in which case it just becomes a software design choice (which often involves callback functions and things like that to provide "notification" of the asynchronous result). With multiple threads, it becomes more powerful, as you can do various things in parallel while the asynchronous task is working. Taken to the extreme, it can become a more full-blown architecture like a task-based approach (which is one kind of asynchronous programming technique).
I think the thing that you want corresponds more to yet another concept: Parallel Computing (or parallel processing). This approach is more about splitting a large processing task into smaller parts and processing all parts in parallel, and then combining the results. You should look into libraries like OpenMP or OpenCL/CUDA (for GPGPU). That said, you can use multi-threading for parallel processing.
but apparently asynchronous programming doesn't necessarily mean you are using multiple cores?
Asynchronous programming does not necessarily involve anything happening concurrently in multiple threads. It could mean that the OS is doing things on your behalf behind the scenes (and will notify you when that work is finished), like in asynchronous I/O, which happens without you creating any threads. It boils down to being a software design choice.
Why would you do this if you didn't have multiple cores to take advantage of?
If you don't have multiple cores, multi-threading can still improve performance by reusing "waiting time" (e.g., don't "block" the processing waiting on file or network I/O, or waiting on the user to click a mouse button). That means the program can do useful work while waiting on those things. Beyond that, it can provide flexibility in the design and make things seem to run simultaneously, which often makes users happier. Still, you are correct that before multi-core CPUs, there wasn't as much of an incentive to do multi-threading, as the gains often do not justify the overhead.

I think in general, all these are design related rather than language related. Same apply to multicore programming.
To reflect Jim, it's not only the file load scenario. Generally, you need to design the whole software to run concurrently in order to feel the real benefit of multi-threading, task based or asynchronous programming.
Try see things from a grand picture point of view. Understand the over all modelling of a specific example and see how these methodologies are implemented. It'll easy to see the difference and help understand when and where to use which.

Periodically call a C function without manually creating a thread

I have implemented a WebSocket handler in C++ and I need to send ping messages once in a while. However, I don't want to start one thread per socket/one global poll thread which only calls the ping function but instead use some OS functionality to call my timer function. On Windows, there is SetTimer but that requires a working message loop (which I don't have.) On Linux there is timer_create, which looks better.
Is there some portable, low-overhead method to get a function called periodically, ideally with some custom context? I.e. something like settimer (const int millisecond, const void* context, void (*callback)(const void*))?
[Edit] Just to make this a bit clearer: I don't want to have to manage additional threads. On Windows, I guess using CreateThreadpoolTimer on the system thread pool will do the trick, but I'm curious to hear if there is a simpler solution and how to port this over to Linux.

If you are intending to go cross-platform, I would suggest you use a cross platform event library like libevent.
libev is newer, however currently has weak Win32 support.

If you use sockets, you can use select, to wait sockets events with timeout,
and in this loop calc time and call callback in suitable time.

If you are looking for a timer that will not require an additional thread, let you do your work transparently and then call the timer function at the appropriate time in the same thread by pre-emptively interrupting your application, then there is no such portable thing.
The first reason is that it's downright dangerous. That's like writing a multi-threaded application with absolutely no synchronization. The second reason is that it is extremely difficult to have good semantics in multi-threaded applications. Which thread should execute the timer callback?
If you're writing a web-socket handler, you are probably already writing a select()-based loop. If so, then you can just use select() with a short timeout and check the different connections for which you need to ping each peer.

Whenever you have asynchronous events, you should have an event loop. This doesn't need to be some system default one, like Windows' message loop. You can create your own. But you should be using it.
The whole point about event-based programming is that you are decoupling your code handling to deal with well-defined functional fragments based on these asynchronous events. Without an event loop, you are condemning yourself to interleaving code that get's input and produces output based on poorly defined "states" that are just fragments of procedural code.
Without a well-defined separation of states using an event-based design, code quickly becomes unmanageable. Because code pauses inside procedures to do input tasks, you have lifetimes of objects that will not span entire procedure scopes, and you will begin to write if (nullptr == xx) in various places that access objects created or destroyed based on events. Dispatch becomes comnbinatorially complex because you have different events expected at each input point and no abstraction.
However, simply using an event loop and dispatch to state machines, you've decreased handling complexity to basic management of handlers (O(n) handlers versus O(mn) branch statements with n types of events and m states). You decouple handling but still allow for functionality to change depending on state. But now these states are well-defined using state classes. And new states can be added if the requirements of the product change.
I'm just saying, stop trying to avoid an event loop. It's a software pattern for very important reasons, all of which have to do with producing professional, reusable, scalable code. Use Boost.ASIO or some other framework for cross platform capabilities. Don't get in the habit of doing it wrong just because you think it will be less of an effort. In the end, even if it's not a professional project that needs maintenance long term, you want to practice making your code professional so you can do something with your skills down the line.

How to keep asynchronous parallel program code manageable (for example in C++)

I am currently working on a server application that needs to control a collection devices over a network. Because of this, we need to do a lot of parallel programming. Over time, I have learned that there are three approaches to communication between processing entities (threads/processes/applications). Regrettably, all three approaches have their disadvantages.
A) You can make a synchronous request (a synchronous function call). In this case, the caller waits until the function is processed and the response has been received. For example:
const bool convertedSuccessfully = Sync_ConvertMovie(params);
The problem is that the caller is idling. Sometimes this is just not an option. For example, if the call was made by the user interface thread, it will seem like the application has blocked until the response arrives, which can take a long time.
B) You can make an asynchronous request and wait for a callback to be made. The client code can continue with whatever needs to be done.
Async_ConvertMovie(params, TheFunctionToCallWhenTheResponseArrives);
This solution has the big disadvantange that the callback function necessarily runs in a separate thread. The problem is now that it is hard to get the response back to the caller. For example, you have clicked a button in a dialog, which called a service asynchronlously, but the dialog has been long closed when the callback arrives.
void TheFunctionToCallWhenTheResponseArrives()
{
//Difficulty 1: how to get to the dialog instance?
//Difficulty 2: how to guarantee in a thread-safe manner that
// the dialog instance is still valid?
}
This in itself is not that big a problem. However, when you want to make more than one of such calls, and they all depend on the response of the previous one, this becomes in my experience unmanageably complex.
C) The last option I see is to make an asynchronous request and keep polling until the response has arrived. In between the has-the-response-arrived-yet checks, you can do something useful. This is the best solution I know of to solve the case in which there is a sequence of asynchronous function calls to make. This is because it has the big advantage that you still have the whole caller context around when the response arrives. Also, the logical sequence of the calls remains reasonably clear. For example:
const CallHandle c1 = Sync_ConvertMovie(sourceFile, destFile);
while(!c1.ResponseHasArrived())
{
//... do something in the meanwhile
}
if (!c1.IsSuccessful())
return;
const CallHandle c2 = Sync_CopyFile(destFile, otherLocation);
while(!c1.ResponseHasArrived())
{
//... do something in the meanwhile
}
if (c1.IsSuccessful())
//show a success dialog
The problem with this third solution is that you cannot return from the caller's function. This makes it unsuitable if the work you want to do in between has nothing to do at all with the work you are getting done asynchronously. For a long time I am wondering if there is some other possibility to call functions asynchronously, one that doesn't have the downsides of the options listed above. Does anyone have an idea, some clever trick perhaps?
Note: the example given is C++-like pseudocode. However, I think this question equally applies to C# and Java, and probably a lot of other languages.

You could consider an explicit "event loop" or "message loop", not too different from classic approaches such as a select loop for asynchronous network tasks or a message loop for a windowing system. Events that arrive may be dispatched to a callback when appropriate, such as in your example B, but they may also in some cases be tracked differently, for example to cause transactions in a finite state machine. A FSM is a fine way to manage the complexity of an interaction along a protocol that requires many steps, after all!
One approach to systematize these consideration starts with the Reactor design pattern.
Schmidt's ACE body of work is a good starting point for these issues, if you come from a C++ background; Twisted is also quite worthwhile, from a Python background; and I'm sure that similar frameworks and sets of whitepapers exist for, as you say, "a lot of other languages" (the Wikipedia URL I gave does point at Reactor implementations for other languages, besides ACE and Twisted).

I tend to go with B, but instead of calling forth and back, I'd do the entire processing including follow-ups on a separate thread. The main thread can meanwhile update the GUI and either actively wait for the thread to complete (i.e. show a dialog with a progress bar), or just let it do its thing in the background and pick up the notification when it's done. No complexity problems so far, since the entire processing is actually synchronous from the processing thread's point of view. From the GUI's point of view, it's asynchronous.
Adding to that, in .NET it's no problem to switch to the GUI thread. The BackgroundWorker class and the ThreadPool make this easy as well (I used the ThreadPool, if I remember correctly). In Qt, for example, to stay with C++, it's quite easy as well.
I used this approach on our last major application and am very pleased with it.

Like Alex said, look at Proactor and Reactor as documented by Doug Schmidt in Patterns of Software Architecture.
There are concrete implementations of these for different platforms in ACE.

Multithreaded job queue manager

I need to manage CPU-heavy multitaskable jobs in an interactive application. Just as background, my specific application is an engineering design interface. As a user tweaks different parameters and options to a model, multiple simulations are run in the background and results displayed as they complete, likely even as the user is still editing values. Since the multiple simulations take variable time (some are milliseconds, some take 5 seconds, some take 10 minutes), it's basically a matter of getting feedback displayed as fast as possible, but often aborting jobs that started previously but are now no longer needed because of the user's changes have already invalidated them. Different user changes may invalidate different computations so at any time I may have 10 different simulations running. Somesimulations have multiple parts which have dependencies (simulations A and B can be seperately computed, but I need their results to seed simulation C so I need to wait for both A and B to finish first before starting C.)
I feel pretty confident that the code-level method to handle this kind of application is some kind of multithreaded job queue. This would include features of submitting jobs for execution, setting task priorities, waiting for jobs to finish, specifying dependencies (do this job, but only after job X and job Y have finished), canceling subsets of jobs that fit some criteria, querying what jobs remain, setting worker thread counts and priorities, and so on. And multiplatform support is very useful too.
These are not new ideas or desires in software, but I'm at the early design phase of my application where I need to make a choice about what library to use for managing such tasks. I've written my own crude thread managers in the past in C (I think it's a rite of passage) but I want to use modern tools to base my work on, not my own previous hacks.
The first thought is to run to OpenMP but I'm not sure it's what I want. OpenMP is great for parallelizing at a fine level, automatically unrolling loops and such. While multiplatform, it also invades your code with #pragmas. But mostly it's not designed for managing large tasks.. especially cancelling pending jobs or specifying dependencies. Possible, yes, but it's not elegant.
I noticed that Google Chrome uses such a job manager for even the most trivial tasks. The design goal seems to be to keep the user interaction thread as light and nimble as possible, so anything that can get spawned off asynchronously, should be. From looking at the Chrome source this doesn't seem to be a generic library, but it still is interesting to see how the design uses asynchronous launches to keep interaction fast. This is getting to be similar to what I'm doing.
There are a still other options:
Surge.Act: a Boost-like library for defining jobs. It builds on OpenMP, but does allow chaining of dependencies which is nice. It doesn't seem to feel like it's got a manager that can be queried, jobs cancelled, etc. It's a stale project so it's scary to depend on it.
Job Queue is quite close to what I'm thinking of, but it's a 5 year old article, not a supported library.
Boost.threads does have nice platform independent synchronization but that's not a job manager. POCO has very clean designs for task launching, but again not a full manager for chaining tasks. (Maybe I'm underestimating POCO though).
So while there are options available, I'm not satisfied and I feel the urge to roll my own library again. But I'd rather use something that's already in existence. Even after searching (here on SO and on the net) I haven't found anything that feels right, though I imagine this must be a kind of tool that is often needed, so surely there's some community library or at least common design.
On SO there's been some posts about job queues, but nothing that seems to fit.
My post here is to ask you all what existing tools I've missed, and/or how you've rolled your own such multithreaded job queue.

We had to build our own job queue system to meet requirements similar to yours ( UI thread must always respond within 33ms, jobs can run from 15-15000ms ), because there really was nothing out there that quite met our needs, let alone was performant.
Unfortunately our code is about as proprietary as proprietary gets, but I can give you some of the most salient features:
We start up one thread per core at the beginning of the program. Each pulls work from a global job queue. Jobs consist of a function object and a glob of associated data (really an elaboration on a func_ptr and void *). Thread 0, the fast client loop, isn't allowed to work on jobs, but the rest grab as they can.
The job queue itself ought to be a lockless data structure, such as a lock-free singly linked list (Visual Studio comes with one). Avoid using a mutex; contention for the queue is surprisingly high, and grabbing mutexes is costly.
Pack up all the necessary data for the job into the job object itself -- avoid having pointer from the job back into the main heap, where you'll have to deal with contention between jobs and locks and all that other slow, annoying stuff. For example, all the simulation parameters should go into the job's local data blob. The results structure obviously needs to be something that outlives the job: you can deal with this either by a) hanging onto the job objects even after they've finished running (so you can use their contents from the main thread), or b) allocating a results structure specially for each job and stuffing a pointer into the job's data object. Even though the results themselves won't live in the job, this effectively gives the job exclusive access to its output memory so you needn't muss with locks.
Actually I'm simplifying a bit above, since we need to choreograph exactly which jobs run on which cores, so each core gets its own job queue, but that's probably unnecessary for you.

I rolled my own, based on Boost.threads. I was quite surprised by how much bang I got from writing so little code. If you don't find something pre-made, don't be afraid to roll your own. Between Boost.threads and your experience since writing your own, it might be easier than you remember.
For premade options, don't forget that Chromium is licensed very friendly, so you may be able to roll your own generic library around its code.

Microsoft is working on a set of technologies for the next Version of Visual Studio 2010 called the Concurrency Runtime, the Parallel Pattern Library and the Asynchronous Agents Library which will probably help. The Concurrency Runtime will offer policy based scheduling, i.e. allowing you to manage and compose multiple scheduler instances (similar to thread pools but with affinitization and load balancing between instances), the Parallel Pattern Library will offer task based programming and parallel loops with an STL like programming model. The Agents library offers an actor based programming model and has support for building concurrent data flow pipelines, i.e. managing those dependencies described above. Unfortunately this isn't released yet, so you can read about it on our team blog or watch some of the videos on channel9 there is also a very large CTP that is available for download as well.
If you're looking for a solution today, Intel's Thread Building Blocks and boost's threading library are both good libraries and available now. JustSoftwareSolutions has released an implementation of std::thread which matches the C++0x draft and of course OpenMP is widely available if you're looking at fine-grained loop based parallelism.
The real challenge as other folks have alluded to is to correctly identify and decompose work into tasks suitable for concurrent execution (i.e. no unprotected shared state), understand the dependencies between them and minimize the contention that can occur on bottlenecks (whether the bottleneck is protecting shared state or ensuring the dispatch loop of a work queue is low contention or lock-free)... and to do this without scheduling implementation details leaking into the rest of your code.
-Rick

Would something like threadpool be useful to you? It's based on boost::threads and basically implements a simple thread task queue that passes worker functions off to the pooled threads.

I've been looking for near the same requirements. I'm working on a game with 4x-ish mechanics and scheduling different parts of what gets done almost exploded my brain. I have a complex set of work that needs to get accomplished at different time resolutions, and to a different degree of actual simulation depending on what system/region the player has actively loaded. This means as the player moves from system to system, I need to load a system to the current high resolution simulation, offload the last system to a lower resolution simulation, and do the same for active/inactive regions of systems. The different simulations are big lists of population, political, military, and economic actions based on profiles of each entity. I'm going to try to describe my issue and my approach so far and I hope it's useful at describe an alternative for you or someone else. The rough outline of the structure I'm building will use the following:
cpp-taskflow (A Modern C++ Parallel Task Programming Library) I'm going to make a library of modules that will be used as job construction parts. Each entry will have an API for initializing and destruction as well as pointers for communication. I'm hoping to write it in a way that they will be nest-able using the cpp-taskflow API to set-up all the dependencies at job creation time, but provide a means of live adjustment and having a kill-switch available. Most of what I'm making will be decision trees of state machines, or state machines of behavior trees so the job data structure will be settings and states of time-resolution tagged data pointing to actual stats and object values.
FlatBuffers I'm looking to use this library to build a "job list entry" as well as an "object wrapper" system. Each entry in the job queues will be a flatbuffer object describing the work needed done(settings for the module), as well as containing the data(or shared pointers to the data) for the work that needs done. The object storage flatbuffers will contain the data that represents entity tables. For me, most of the actual data will me arrays that need deciding/working on. I'm also looking to use flatbuffers as a communication/control channel between threads. I'm torn on making a master "router" thread all the others communicate through, or each one containing their own, and having some mechanism of discovery.
SQLite Since only the active regions/systems need higher resolution work done, some of the background job lists the game will create(for thousands of systems and their entities) will be pretty large and long lived. 100's of thousands - millions of jobs(big in my mind), each requiring an unknown amount of time to complete. In my case, I don't care when they get done, as long as they all do(long campains). I plan on each thread getting a table of an in-memory sqlite db as a job queue. Each entry will contain a blob of flatbuffer work, a pointer to a buffer to notify upon completion, a pointer to a control buffer for updates, and other fields decorating the job item(location, data ranges, priority) that will get filled as the job entry makes new jobs, and as the items are consumed into the database. This give me a way I can create relational ties between jobs and simply construct queries if I need to re-work/update jobs, remove them and their dependencies, or update/re-order priorities or dependencies. All this being used in an sqlite db also means that at any time I can dump the whole thing to disk and reload it later, or switch to attaching to and processing it from disk. Additionally, this gives me access to a lot of search and ordering algorithmic work I'd normally need a bunch of different types of containers for. Being able to use SQL queries gives me a lot of options to process the jobs.
The communication queue(as a db) is what I'm torn as to whether I should make access via the corresponding thread(each thread contains it's own messaging db, and the module API has locks/mutex abstracted for access), or have all updates, adds/removes, and communication via some master router thread into one large db. I have no idea which will give me the least headaches as far as mutexing and locks. I got a few days into making a monster spaghetti beast of shared pointers to sbuffer pools and lookup tables, so each thread had it's own buffer in, and separate out buffers. That's when I decided to just offload the giant list keeping to sqlite. Then I thought, why not just feed the flatbuffer objects of everything else into tables.
Having almost everything in a db means from each module, I can write sql statements that represent the view of the data I need to work on as well as pivot on the fly as to how the data is worked on. Having the jobs themselves in a db means I can do the same for them as well. SQLite has multi-threading access, so using it as a Multithreaded job queue manager shouldn't be too much of a stretch.
In summary, Cpp-Taskflow will allow you to setup complicated nested loops with dependency chaining and job-pool multithreading. Out of the box it comes with most of the structure you need. FlatBuffers will allow you to create job declarations and object wrappers easy to feed into stream-buffers as one unit of work and pass them between job threads, and SQLite will allow you to tag and queue the stream-buffer jobs into blob entries in a way that should allow adding, searching, ordering, updating, and removal with minimal work on your end. It also makes saving and reloading a breeze. Snapshots and roll-backs should also be doable, you just have to keep your mind wrapped around the order and resolution of events for the db.
Edit: Take this with a grain of salt though, I found your question because I'm trying to accomplish what Crashworks described. I'm thinking of using affinity to open long living threads and have the master thread run the majority of the Cpp-Taskflow hierarchy work, feeding jobs to the others. I've yet to use the sqlite meothod of job-queue/control communication, that's just my plan so far.
I hope someone finds this helpful.

You might want to look at Flow-Based Programming - it is based on data chunks streaming between asynchronous components. There are Java and C# versions of the driver, plus a number of precoded components. It is intrinsically multithreaded - in fact the only single-threaded code is within the components, although you can add timing constraints to the standard scheduling rules. Although it may be at too fine-grained a level for what you need, there may be stuff here you can use.

Take a look at boost::future (but see also this discussion and proposal) which looks like a really nice foundation for parallelism (in particular it seems to offer excellent support for C-depends-on-A-and-B type situations).
I looked at OpenMP a bit but (like you) wasn't convinced it would work well for anything but Fortran/C numeric code. Intel's Threading Building Blocks looked more interesting to me.
If it comes to it, it's not too hard to roll your own on top of boost::thread.
[Explanation: a thread farm (most people would call it a pool) draws work from a thread-safe queue of functors (tasks or jobs). See the tests and benchmark for examples of use. I have some extra complication to (optionally) support tasks with priorities, and the case where executing tasks can spawn more tasks into the work queue (this makes knowing when all the work is actually completed a bit more problematic; the references to "pending" are the ones which can deal with the case). Might give you some ideas anyway.]

You may like to look at Intel Thread Building Blocks. I beleave it does what you want and with version 2 it's Open Source.

There's plenty of distributed resource managers out there. The software that meets nearly all of your requirements is Sun Grid Engine. SGE is used on some of the worlds largest supercomputers and is in active development.
There's also similar solutions in Torque, Platform LSF, and Condor.
It sounds like you may want to roll your own but there's plenty of functionality in all of the above.

I don't know if you're looking for a C++ library (which I think you are), but Doug Lea's Fork/Join framework for Java 7 is pretty nifty, and does exactly what you want. You'd probably be able to implement it in C++ or find a pre-implemented library.
More info here:
http://artisans-serverintellect-com.si-eioswww6.com/default.asp?W1

A little late to the punch perhaps, but take a look also at ThreadWeaver:
http://en.wikipedia.org/wiki/ThreadWeaver

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js