Deadlocks in concurrent transactions

Deadlocks in concurrent transactions - concurrency

I am trying to find an algorithm that detects deadlocks in concurrent transactions in a software. I have tried googling but have not found anything. Can someone point be to a good resource to follow on the subject or can someone explain this algorithm?

Detecting deadlock implies some knowledge of the resources that are being acquired. In the simpler cases a single resource manager (such as a database) owns the resources (such as locks on records) and so can detect a cycle in lock requests. Hence the algorithms such as those discussed here can be applied.
In the case of two concurrent transactions, taking locks on arbitrary resources I don't see how we can do this without a "supervisor" view of all the locks that are being taken. If we have that supervisory view then we can apply the algorithms referenced earlier.
Primarily we are looking for cycles in the resourece reaquirements. This is an outline of an approach.

Related

Strandify inter coorporating objects for multithread support

My current application owns multiple «activatable» objects*. My intent is to "run" all those object in the same io_context and to add the necessary protection in order to toggle from single to multiple threads (to make it scalable)
If these objects were completely independent from each others, the number of threads running the associated io_context could grow smoothly. But since those objects need to cooperate, the application crashes in multithread despite the strand in each object.
Let's say we have objects of type A and type B, all of them served by the same io_context. Each of those types run asynchronous operations (timers and sockets - their handlers are surrounded with bind_executor(strand, handler)), and can build a cache based on information received via sockets and posted operations to them. Objects of type A needs to get information cached from multiple instances of B in order to perform their own work.
Would it be possible to access this information by using strands (without adding explicit mutex protection) and if yes how ?
If not, what strategy could be adopted to achieve the scalability?
I already tried playing with futures but that strategy leads unsurprisingly to deadlocks.
Thanx
(*) Maybe I'm wrong in the terminology: objects get a reference to an io_context and own their own strand, so I think they are activatable, because they don't own really a running thread

You're mixing vague words a bit. "Activatable", "Strandify", "inter coorporating". They're all close to meaningful concepts, yet, narrowly avoid binding to any precise meaning.
Deconstructing
Let's simplify using more precise concepts.
Let's say we have objects of type A and type B, all of them served by the same io_context
I think it's more fruitful to say "types A and B have associated executors". When you make sure all operations on A and B operate from that executor and you make sure that executor serializes access, then you basically get the Active Object pattern.
[can build a cache based on information received via sockets] and posted operations to them
That's interesting. I take that to mean you don't directly call members of the class, unless they defer the actual execution to the strand. This, again, would be the Active Object.
However, your symptoms suggest that not all operations are "posted to them". Which implies they run on arbitrary threads, leading to your problem.
Would it be possible to access this information by using strands (without adding explicit mutex protection) and if yes how ?
The key to your problems is here. Data dependencies. It's also, ;ole;y going to limit the usefulness of scaling, unless of course the generation of information to retrieve from other threads is a computationally expensive operation.
However, in the light of the phrase _"to get information cached from multiple instances of B'" suggests that in fact, the data is instantaneous, and you'll just be paying synchronization costs for accessing across threads.
Questions
Q. Would it be possible to access this information by using strands (without adding explicit mutex protection) and if yes how ?
Technically, yes. By making sure all operations go on the strand, and the objects become true active objects.
However, there's an important caveat: strands aren't zero-cost. Only in certain contexts they can be optimized (e.g. in immediate continuations or when the execution context has no concurrency).
But in all other contexts, they end up synchronizing at similar cost as mutexes. The purpose of a strand is not to remove the lock contention. Instead it rather allows one to declaratively specify the synchronization requirements for tasks, so that so that the same code can be correctly synchronized regardless of the methods of async completion (using callbacks, futures, coroutines, awaitables, etc) or the chosen execution context(s).
Example: I recently uncovered a vivid illustration of the cost of strand synchronization even in a simple context (where serial execution was already implicitly guaranteed) here:
sehe mar 15, 23:08 Oh cool. The strands were unnecessary. I add them for safety until I know it's safe to go without. In this case the async call chains form logical strands (there are no timers or full duplex sockets going on, so it's all linear). That... improves the situation :)
Now it's 3.5gbps even with the 1024 byte server buffer
The throughput increased ~7x from just removing the strand.
Q. If not, what strategy could be adopted to achieve the scalability?
I suspect you really want caches that contain shared_futures. So that the first retrieval puts the future for the result in cache, where subsequent retrievals get the already existing shared future immediately.
If you make sure your cache lookup datastructure is threadsafe, likely with a reader/writer lock (shared_mutex), you will be free to access it with minimal overhead from any actor, instead of requiring to go through individual strands of each producer.
Keep in mind that awaiting futures is a blocking operation. So, if you do that from tasks posted on the execution context, you may easily run out of threads. In such cases it maybe better to provide async_get in terms of boost::asio::async_result or boost::asio::async_completion so you can wait in non-blocking fashion.

What is the executor pattern in a C++ context?

The author of asio, Christopher Kohlhoff, is working on a library and proposal for executors in C++. His work so far includes this repo and docs. Unfortunately, the rationale portion has yet to be written. So far, the docs give a few examples of what the library does but I don't feel like I'm missing something. Somehow this is more than a family of fancy invoker functions.
Everything I can find on Google is very Java specific and a lot of it is particular to specific frameworks so I'm having trouble figuring out what this "executor pattern" is all about.
What are executors in this context? What do they do? What are the canonical examples of when they would be helpful? What variations exist among executors? What are the alternatives to executors and how do they compare? In particular, there seems to be a lot of overlap with an event loop where the events are initial input events, execution events, and a shutdown event.
When trying to figure out new abstractions I usually find understanding the motivation key. So for executors, what are we trying to abstract and why? What are we trying to make generic? Without executors, what extra work would we have to do?

The most basic benefit of executors is separating the definition of a program's parallelism from how it's used. Java's executor model exists because, by and large, you don't actually know, when you're first writing code, what parallelism model is best for your scenario. You might have little to gain from parallelism and shouldn't use threads at all, you might do best with a long running dedicated worker thread for each core, or a dynamically scaling pool of threads based on current load that cleans up threads after they've been idle a while to reduce memory usage, context switches, etc., or maybe just launching a thread for every task on demand, exiting when the task is done.
The key here is it's nigh impossible to know which approach is best when you're first writing code. You may know where parallelism might help you, but in traditional threading, you end up intermingling the parallelism "configuration" (when and whether to create threads) with the use of parallelism (determining which functions to call with what arguments). When you do mix the code like this, it's a royal pain to do performance testing of different options, because each and every thread launch is independent, and must be updated separately.
The main benefit of the executor model is that the parallelism configuration is done in one place (where the executor is created), and the users of that executor don't have to know anything about it. They just submit work to the executor, receive a future, and at some later point, retrieve the result (blocking if necessary) from the future. If you want to experiment with other configurations, you change the one line defining the executor and run your code again. Even if you decide you need to use different parallelism models for different sections of your code, refactoring to add a second executor and change some of the users of the first executor to use the second is easy compared to manually rewriting the threading details of every site; as long as the executor's name is (relatively) unique, finding users and changing them to use a different one is pretty easy. Executors both simplify your code (by avoiding intermingling thread creation/management with the tasks the threads do) and simplify performance testing.
As a side-benefit, you also abstract away the complexities of transferring data into and out of a worker thread (the submit method encapsulates the former, the future's result method encapsulates the latter). std::async gets you some of this benefit, but with no real control over the parallelism involved (just a yes/no/maybe choice of whether to force a thread, force deferred execution in the current thread, or let the compiler/library decide, with no fine grained control over whether a thread pool is used, and if so, how it behaves). A true executor framework gives you the control std::async fails to provide, with similar ease of use.

Application of Shared Read Locks

what is the need for a read shared lock?
I can understand that write locks have to be exclusive only. But what is the need for many clients to access the document simultaneously and still share only read privilege? Practical applications of Shared read locks would be of great help too.
Please move the question to any other forum you'd find it appropriate to be in.
Though this is a question purely related to ABAP programming and theory I'm doing, I'm guessing the applications are generic to all languages.
Thanks!

If you do complex and time-consuming calculations based on multiple datasets (e. g. postings), you have to ensure that none of these datasets is changed while you're working - otherwise the calculations might be wrong. Most of the time, the ACID principles will ensure this, but sometimes, that's not enough - for example if the datasource is so large that you have to break it up into parallel subtasks or if you have to call some function that performs a database commit or rollback internally. In this case, the transaction isolation is no longer enough, and you need to lock the entity on a logical level.

List of concurrency models

I'd like a large list so I can reference this for ideas. Some answers already have been enlightening .
What are some concurrency models? I heard of message passing where there is no memory shared. Futures which returns an object right away (so it doesn't block) and allows you to dereference the original function returns value later when you need it blocking if the results are not ready yet. I heard of coroutines, software transactional memory and random others.
I searched for a list or a wiki and couldn't find any good ones (many did not list the 3 I mentioned above) and many results gave me a complicated description explaining how it works rather then what it does or how it is to be used.
What are some concurrency models and what is a simple description of what they do? One per answer.

Actor Model
I heard of message passing where there is no memory shared.
Is it about Erlang-style Actors?
Scala uses this idea in its Actors framework (thus, in Scala its not a part of the language, just a library) and it looks quite sexy!
In a few words Actors are objects that have no shared data at all, but can use async messages for interaction. Actors can be located on one or different hosts and use interesting error handling policy (when error happened - actor just dies).
You should read more on this in Erlang and Scala docs, its really straightforward and progressive approach!
Chapters 3, 17, 17.11:
http://www.scala-lang.org/sites/default/files/linuxsoft_archives/docu/files/ScalaByExample.pdf
https://en.wikipedia.org/wiki/Actor_model

COM Threading (Concurrency) Model
Single-Threaded Apartments
Multi-Threaded Apartments
Mixed Model Development
COM objects can be used in multiple threads of a process. The terms
"Single- threaded Apartmen*t" (STA) and
"*Multi-threaded Apartment" (MTA) are
used to create a conceptual framework
for describing the relationship
between objects and threads, the
concurrency relationships among
objects, the means by which method
calls are delivered to an object, and
the rules for passing interface
pointers among threads. Components and
their clients choose between the
following two apartment models
presently supported by COM:
Single-threaded Apartment model (STA):
One or more threads in a process use
COM and calls to COM objects are
synchronized by COM. Interfaces are
marshaled between threads. A
degenerate case of the single-threaded
apartment model, where only one thread
in a given process uses COM, is called
the single-threading model. Previous
Microsoft information and
documentation has sometimes referred
to the STA model simply as the
"apartment model." Multi-threaded
Apartment model (MTA): One or more
threads use COM and calls to COM
objects associated with the MTA are
made directly by all threads
associated with the MTA without any
interposition of system code between
caller and object. Because multiple
simultaneous clients may be calling
objects more or less simultaneously
(simultaneously on multi-processor
systems), objects must synchronize
their internal state by themselves.
Interfaces are not marshaled between
threads. Previous Microsoft
information and documentation has
sometimes referred to this model as
the "free-threaded model." Both the
STA model and the MTA model can be
used in the same process. This is
sometimes referred to as a
"mixed-model" process.
Other models according to Wikipedia
There are several models of concurrent
computing, which can be used to
understand and analyze concurrent
systems. These models include:
Actor model
Object-capability model for security
Petri nets
Process calculi such as
Ambient calculus
Calculus of Communicating Systems (CCS)
Communicating Sequential Processes (CSP)
π-calculus

Futures
A future is a place-holder for the
undetermined result of a (concurrent)
computation. Once the computation
delivers a result, the associated
future is eliminated by globally
replacing it with the result value.
That value may be a future on its own.
Whenever a future is requested by a
concurrent computation, i.e. it tries
to access its value, that computation
automatically synchronizes on the
future by blocking until it becomes
determined or failed.
There are four kinds of futures:
concurrent futures stand for the result of a concurrent computation,
lazy futures stand for the result of a computation that is only performed on request,
promised futures stand for a value that is promised to be delivered later by explicit means,
failed futures represent the result of a computation that terminated with an exception.

Software transactional memory
In computer science, software
transactional memory (STM) is a
concurrency control mechanism
analogous to database transactions for
controlling access to shared memory in
concurrent computing. It is an
alternative to lock-based
synchronization. A transaction in this
context is a piece of code that
executes a series of reads and writes
to shared memory. These reads and
writes logically occur at a single
instant in time; intermediate states
are not visible to other (successful)
transactions. The idea of providing
hardware support for transactions
originated in a 1986 paper and patent
by Tom Knight[1]. The idea was
popularized by Maurice Herlihy and J.
Eliot B. Moss[2]. In 1995 Nir Shavit
and Dan Touitou extended this idea to
software-only transactional memory
(STM)[3]. STM has recently been the
focus of intense research and support
for practical implementations is
growing.

There's also map/reduce.
The idea is to spawn many instances of a sub problem and to combine the answers when they're done. A simple example would be matrix multiplication, which is the sum of several dot products. You spawn a worker thread for each dot product, and when all the threads are finished you sum the result.
This is how GPUs, functional languages such as LISP/Scheme/APL, and some frameworks (Google's Map/Reduce) handle concurrency.

Coroutines
In computer science, coroutines are
program components that generalize
subroutines to allow multiple entry
points for suspending and resuming
execution at certain locations.
Coroutines are well-suited for
implementing more familiar program
components such as cooperative tasks,
iterators, infinite lists and pipes.

There's also non-blocking concurrency such as compare-and-swap and load-link/store-conditional instructions. For example, compare-and-swap (cas) could be defined as so:
bool cas( int new_value, int current_value, int * location );
This operation will then attempt to set the value at location to the value passed in new_value, but only if the value in location is the same as current_value. This only requires one instruction and is usually how blocking concurrency (mutexes/semaphores/etc.) are implemented.

IPC (including MPI and RMI)
Hi,
in the wiki pages you can find that MPI (message passing interface) is a methods of a general IPC technique: http://en.wikipedia.org/wiki/Inter-process_communication
Another interesting approach is a Remote procedure call. For example Java's RMI enables you
to focus only on your application domain and communication patterns. It's an "application level" concurrency.
http://www.oracle.com/technetwork/java/javase/tech/index-jsp-136424.html
There a various design patterns/tools available to aid in shared memory model prallelization. Apart from the mentioned futures one can also take advantage of:
1. Thread pool pattern - focuses on task distribution between fixed number of threads: http://en.wikipedia.org/wiki/Thread_pool_pattern
2. Scheduler pattern - controls the threads execution according to a chosen scheduling policy http://en.wikipedia.org/wiki/Scheduler_pattern
3. Reactor pattern - to embed a single threaded application in a parallel environment http://en.wikipedia.org/wiki/Reactor_pattern
4. OpenMP (allows to parallelize part of code by means of preprocessor pragmas)
Regards,
Marcin

Parallel Random Access Machine (PRAM) is useful for complexity/tractability isues (please refer to a nice book for details).
About models you will also find something here (by Blaise Barney)

How about tuple space?
A tuple space is an implementation of the associative memory paradigm
for parallel/distributed computing. It provides a repository of tuples
that can be accessed concurrently. As an illustrative example,
consider that there are a group of processors that produce pieces of
data and a group of processors that use the data. Producers post their
data as tuples in the space, and the consumers then retrieve data from
the space that match a certain pattern. This is also known as the
blackboard metaphor. Tuple space may be thought as a form of
distributed shared memory.

LMAX's disruptor pattern keeps data in place and assures only one thread (consumer or producer) is owner of a data item (=queue slot) at a time.

Multithreaded job queue manager

I need to manage CPU-heavy multitaskable jobs in an interactive application. Just as background, my specific application is an engineering design interface. As a user tweaks different parameters and options to a model, multiple simulations are run in the background and results displayed as they complete, likely even as the user is still editing values. Since the multiple simulations take variable time (some are milliseconds, some take 5 seconds, some take 10 minutes), it's basically a matter of getting feedback displayed as fast as possible, but often aborting jobs that started previously but are now no longer needed because of the user's changes have already invalidated them. Different user changes may invalidate different computations so at any time I may have 10 different simulations running. Somesimulations have multiple parts which have dependencies (simulations A and B can be seperately computed, but I need their results to seed simulation C so I need to wait for both A and B to finish first before starting C.)
I feel pretty confident that the code-level method to handle this kind of application is some kind of multithreaded job queue. This would include features of submitting jobs for execution, setting task priorities, waiting for jobs to finish, specifying dependencies (do this job, but only after job X and job Y have finished), canceling subsets of jobs that fit some criteria, querying what jobs remain, setting worker thread counts and priorities, and so on. And multiplatform support is very useful too.
These are not new ideas or desires in software, but I'm at the early design phase of my application where I need to make a choice about what library to use for managing such tasks. I've written my own crude thread managers in the past in C (I think it's a rite of passage) but I want to use modern tools to base my work on, not my own previous hacks.
The first thought is to run to OpenMP but I'm not sure it's what I want. OpenMP is great for parallelizing at a fine level, automatically unrolling loops and such. While multiplatform, it also invades your code with #pragmas. But mostly it's not designed for managing large tasks.. especially cancelling pending jobs or specifying dependencies. Possible, yes, but it's not elegant.
I noticed that Google Chrome uses such a job manager for even the most trivial tasks. The design goal seems to be to keep the user interaction thread as light and nimble as possible, so anything that can get spawned off asynchronously, should be. From looking at the Chrome source this doesn't seem to be a generic library, but it still is interesting to see how the design uses asynchronous launches to keep interaction fast. This is getting to be similar to what I'm doing.
There are a still other options:
Surge.Act: a Boost-like library for defining jobs. It builds on OpenMP, but does allow chaining of dependencies which is nice. It doesn't seem to feel like it's got a manager that can be queried, jobs cancelled, etc. It's a stale project so it's scary to depend on it.
Job Queue is quite close to what I'm thinking of, but it's a 5 year old article, not a supported library.
Boost.threads does have nice platform independent synchronization but that's not a job manager. POCO has very clean designs for task launching, but again not a full manager for chaining tasks. (Maybe I'm underestimating POCO though).
So while there are options available, I'm not satisfied and I feel the urge to roll my own library again. But I'd rather use something that's already in existence. Even after searching (here on SO and on the net) I haven't found anything that feels right, though I imagine this must be a kind of tool that is often needed, so surely there's some community library or at least common design.
On SO there's been some posts about job queues, but nothing that seems to fit.
My post here is to ask you all what existing tools I've missed, and/or how you've rolled your own such multithreaded job queue.

We had to build our own job queue system to meet requirements similar to yours ( UI thread must always respond within 33ms, jobs can run from 15-15000ms ), because there really was nothing out there that quite met our needs, let alone was performant.
Unfortunately our code is about as proprietary as proprietary gets, but I can give you some of the most salient features:
We start up one thread per core at the beginning of the program. Each pulls work from a global job queue. Jobs consist of a function object and a glob of associated data (really an elaboration on a func_ptr and void *). Thread 0, the fast client loop, isn't allowed to work on jobs, but the rest grab as they can.
The job queue itself ought to be a lockless data structure, such as a lock-free singly linked list (Visual Studio comes with one). Avoid using a mutex; contention for the queue is surprisingly high, and grabbing mutexes is costly.
Pack up all the necessary data for the job into the job object itself -- avoid having pointer from the job back into the main heap, where you'll have to deal with contention between jobs and locks and all that other slow, annoying stuff. For example, all the simulation parameters should go into the job's local data blob. The results structure obviously needs to be something that outlives the job: you can deal with this either by a) hanging onto the job objects even after they've finished running (so you can use their contents from the main thread), or b) allocating a results structure specially for each job and stuffing a pointer into the job's data object. Even though the results themselves won't live in the job, this effectively gives the job exclusive access to its output memory so you needn't muss with locks.
Actually I'm simplifying a bit above, since we need to choreograph exactly which jobs run on which cores, so each core gets its own job queue, but that's probably unnecessary for you.

I rolled my own, based on Boost.threads. I was quite surprised by how much bang I got from writing so little code. If you don't find something pre-made, don't be afraid to roll your own. Between Boost.threads and your experience since writing your own, it might be easier than you remember.
For premade options, don't forget that Chromium is licensed very friendly, so you may be able to roll your own generic library around its code.

Microsoft is working on a set of technologies for the next Version of Visual Studio 2010 called the Concurrency Runtime, the Parallel Pattern Library and the Asynchronous Agents Library which will probably help. The Concurrency Runtime will offer policy based scheduling, i.e. allowing you to manage and compose multiple scheduler instances (similar to thread pools but with affinitization and load balancing between instances), the Parallel Pattern Library will offer task based programming and parallel loops with an STL like programming model. The Agents library offers an actor based programming model and has support for building concurrent data flow pipelines, i.e. managing those dependencies described above. Unfortunately this isn't released yet, so you can read about it on our team blog or watch some of the videos on channel9 there is also a very large CTP that is available for download as well.
If you're looking for a solution today, Intel's Thread Building Blocks and boost's threading library are both good libraries and available now. JustSoftwareSolutions has released an implementation of std::thread which matches the C++0x draft and of course OpenMP is widely available if you're looking at fine-grained loop based parallelism.
The real challenge as other folks have alluded to is to correctly identify and decompose work into tasks suitable for concurrent execution (i.e. no unprotected shared state), understand the dependencies between them and minimize the contention that can occur on bottlenecks (whether the bottleneck is protecting shared state or ensuring the dispatch loop of a work queue is low contention or lock-free)... and to do this without scheduling implementation details leaking into the rest of your code.
-Rick

Would something like threadpool be useful to you? It's based on boost::threads and basically implements a simple thread task queue that passes worker functions off to the pooled threads.

I've been looking for near the same requirements. I'm working on a game with 4x-ish mechanics and scheduling different parts of what gets done almost exploded my brain. I have a complex set of work that needs to get accomplished at different time resolutions, and to a different degree of actual simulation depending on what system/region the player has actively loaded. This means as the player moves from system to system, I need to load a system to the current high resolution simulation, offload the last system to a lower resolution simulation, and do the same for active/inactive regions of systems. The different simulations are big lists of population, political, military, and economic actions based on profiles of each entity. I'm going to try to describe my issue and my approach so far and I hope it's useful at describe an alternative for you or someone else. The rough outline of the structure I'm building will use the following:
cpp-taskflow (A Modern C++ Parallel Task Programming Library) I'm going to make a library of modules that will be used as job construction parts. Each entry will have an API for initializing and destruction as well as pointers for communication. I'm hoping to write it in a way that they will be nest-able using the cpp-taskflow API to set-up all the dependencies at job creation time, but provide a means of live adjustment and having a kill-switch available. Most of what I'm making will be decision trees of state machines, or state machines of behavior trees so the job data structure will be settings and states of time-resolution tagged data pointing to actual stats and object values.
FlatBuffers I'm looking to use this library to build a "job list entry" as well as an "object wrapper" system. Each entry in the job queues will be a flatbuffer object describing the work needed done(settings for the module), as well as containing the data(or shared pointers to the data) for the work that needs done. The object storage flatbuffers will contain the data that represents entity tables. For me, most of the actual data will me arrays that need deciding/working on. I'm also looking to use flatbuffers as a communication/control channel between threads. I'm torn on making a master "router" thread all the others communicate through, or each one containing their own, and having some mechanism of discovery.
SQLite Since only the active regions/systems need higher resolution work done, some of the background job lists the game will create(for thousands of systems and their entities) will be pretty large and long lived. 100's of thousands - millions of jobs(big in my mind), each requiring an unknown amount of time to complete. In my case, I don't care when they get done, as long as they all do(long campains). I plan on each thread getting a table of an in-memory sqlite db as a job queue. Each entry will contain a blob of flatbuffer work, a pointer to a buffer to notify upon completion, a pointer to a control buffer for updates, and other fields decorating the job item(location, data ranges, priority) that will get filled as the job entry makes new jobs, and as the items are consumed into the database. This give me a way I can create relational ties between jobs and simply construct queries if I need to re-work/update jobs, remove them and their dependencies, or update/re-order priorities or dependencies. All this being used in an sqlite db also means that at any time I can dump the whole thing to disk and reload it later, or switch to attaching to and processing it from disk. Additionally, this gives me access to a lot of search and ordering algorithmic work I'd normally need a bunch of different types of containers for. Being able to use SQL queries gives me a lot of options to process the jobs.
The communication queue(as a db) is what I'm torn as to whether I should make access via the corresponding thread(each thread contains it's own messaging db, and the module API has locks/mutex abstracted for access), or have all updates, adds/removes, and communication via some master router thread into one large db. I have no idea which will give me the least headaches as far as mutexing and locks. I got a few days into making a monster spaghetti beast of shared pointers to sbuffer pools and lookup tables, so each thread had it's own buffer in, and separate out buffers. That's when I decided to just offload the giant list keeping to sqlite. Then I thought, why not just feed the flatbuffer objects of everything else into tables.
Having almost everything in a db means from each module, I can write sql statements that represent the view of the data I need to work on as well as pivot on the fly as to how the data is worked on. Having the jobs themselves in a db means I can do the same for them as well. SQLite has multi-threading access, so using it as a Multithreaded job queue manager shouldn't be too much of a stretch.
In summary, Cpp-Taskflow will allow you to setup complicated nested loops with dependency chaining and job-pool multithreading. Out of the box it comes with most of the structure you need. FlatBuffers will allow you to create job declarations and object wrappers easy to feed into stream-buffers as one unit of work and pass them between job threads, and SQLite will allow you to tag and queue the stream-buffer jobs into blob entries in a way that should allow adding, searching, ordering, updating, and removal with minimal work on your end. It also makes saving and reloading a breeze. Snapshots and roll-backs should also be doable, you just have to keep your mind wrapped around the order and resolution of events for the db.
Edit: Take this with a grain of salt though, I found your question because I'm trying to accomplish what Crashworks described. I'm thinking of using affinity to open long living threads and have the master thread run the majority of the Cpp-Taskflow hierarchy work, feeding jobs to the others. I've yet to use the sqlite meothod of job-queue/control communication, that's just my plan so far.
I hope someone finds this helpful.

You might want to look at Flow-Based Programming - it is based on data chunks streaming between asynchronous components. There are Java and C# versions of the driver, plus a number of precoded components. It is intrinsically multithreaded - in fact the only single-threaded code is within the components, although you can add timing constraints to the standard scheduling rules. Although it may be at too fine-grained a level for what you need, there may be stuff here you can use.

Take a look at boost::future (but see also this discussion and proposal) which looks like a really nice foundation for parallelism (in particular it seems to offer excellent support for C-depends-on-A-and-B type situations).
I looked at OpenMP a bit but (like you) wasn't convinced it would work well for anything but Fortran/C numeric code. Intel's Threading Building Blocks looked more interesting to me.
If it comes to it, it's not too hard to roll your own on top of boost::thread.
[Explanation: a thread farm (most people would call it a pool) draws work from a thread-safe queue of functors (tasks or jobs). See the tests and benchmark for examples of use. I have some extra complication to (optionally) support tasks with priorities, and the case where executing tasks can spawn more tasks into the work queue (this makes knowing when all the work is actually completed a bit more problematic; the references to "pending" are the ones which can deal with the case). Might give you some ideas anyway.]

You may like to look at Intel Thread Building Blocks. I beleave it does what you want and with version 2 it's Open Source.

There's plenty of distributed resource managers out there. The software that meets nearly all of your requirements is Sun Grid Engine. SGE is used on some of the worlds largest supercomputers and is in active development.
There's also similar solutions in Torque, Platform LSF, and Condor.
It sounds like you may want to roll your own but there's plenty of functionality in all of the above.

I don't know if you're looking for a C++ library (which I think you are), but Doug Lea's Fork/Join framework for Java 7 is pretty nifty, and does exactly what you want. You'd probably be able to implement it in C++ or find a pre-implemented library.
More info here:
http://artisans-serverintellect-com.si-eioswww6.com/default.asp?W1

A little late to the punch perhaps, but take a look also at ThreadWeaver:
http://en.wikipedia.org/wiki/ThreadWeaver

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js