concurrent garbage collection for a c++ graph data structure - c++

I have a directed graph data structure used for audio signal processing (see http://audulus.com if you're curious).
I would like graph edges to be strong references, so in the absence of cycles, std::shared_ptr would do the trick. Alas, there are potentially cycles in the graph.
So, I had this idea for a simple concurrent mark-sweep collector:
The mutator thread sends events to the collector thread. The collector thread maintains its own representation of the graph and does not traverse the mutator thread's graph. The collector thread just uses mark-sweep at regular intervals.
The events would be the following (in function call form):
AddRoot(Node*)
RemoveRoot(Node*)
AddEdge(Node*, Node*)
RemoveEdge(Node*, Node*)
Is this scheme correct? The collector thread has an older version of what the mutator thread sees. My intuition is that since a node that is unreachable at an earlier time will still be unreachable at a later time, the collector thread may delete an unreachable object as soon as it finds one.
Also, if it's correct for one mutator thread, would it work for multiple mutator threads?
UPDATE
I've released the code here: https://github.com/audulus/collector. The code is actually fairly general purpose. Use RootPtr<T> to automatically keep track of root nodes. Links between nodes are managed using EdgePtr<T>.
The collector seems to work for multiple mutator threads (both in my app and in unit tests), but I feel like a proof of correctness is needed.
PLEASE NOTE (in repsonse to #AaronGolden's comment below, judging from the comments below, people aren't reading this): The mutator thread is responsible for calling the collector functions in the correct order. For example, if the mutator thread calls RemoveEdge(a,b) before assigning b to a RootPtr, the collector may intervene and collect b.
UPDATE 2:
I've updated the code to my latest version and updated the link above. I've now used the code in my app for over a year and haven't attributed any bugs to it.

One argument I think is somewhat persuasive (though I would hesitate to call it proof) that this scheme works is that in the absence of cycles, the scheme is equivalent to reference counting with atomic reference counts.
In the absence of cycles, AddRoot and AddEdge map to incrementing a reference count and RemoveRoot and RemoveEdge map to decrementing. Pushing an event onto the queue (I use boost::lockfree::queue) is an atomic operation just like the updating reference counts.
So then the remaining question is: how do cycles change the picture in terms of correctness? To wave hands a bit, cycles are a property of the connectivity of the graph, but don't have an effect on the atomicity of the operations or the ability of one thread to know something earlier than it would otherwise (causing a potential bad ordering of operations).
This would suggest that if there's a counterexample for the scheme, it will involve playing some game with cycles.

Is this scheme correct?
I'm concerned that you don't have any concept of safe points. In particular, can an update require more than one of your actions to be executed atomically? Perhaps it is ok because you can always add all vertices and edges in a batch before removing.
Also, if it's correct for one mutator thread, would it work for multiple mutator threads?
If one thread drops a root to a subgraph just after another picks up a root to the same subgraph then you must make sure you get the messages in-order which means you cannot use per-mutator queues. And a global queue is likely to kill scalability.
One of my constraints is that the GC has to be lock-free because of my real-time DSP thread
Is the allocator lock-free?
What if the GC cannot keep up with mutator(s)?
Also, I would recommend considering:
Forking and collecting in a child process.
Incremental mark-sweep with Dijkstra's tricolor marking sceme.
Baker's treadmill.
VCGC.
Separate per-thread heaps with deep copying of messages.

Related

Strandify inter coorporating objects for multithread support

My current application owns multiple «activatable» objects*. My intent is to "run" all those object in the same io_context and to add the necessary protection in order to toggle from single to multiple threads (to make it scalable)
If these objects were completely independent from each others, the number of threads running the associated io_context could grow smoothly. But since those objects need to cooperate, the application crashes in multithread despite the strand in each object.
Let's say we have objects of type A and type B, all of them served by the same io_context. Each of those types run asynchronous operations (timers and sockets - their handlers are surrounded with bind_executor(strand, handler)), and can build a cache based on information received via sockets and posted operations to them. Objects of type A needs to get information cached from multiple instances of B in order to perform their own work.
Would it be possible to access this information by using strands (without adding explicit mutex protection) and if yes how ?
If not, what strategy could be adopted to achieve the scalability?
I already tried playing with futures but that strategy leads unsurprisingly to deadlocks.
Thanx
(*) Maybe I'm wrong in the terminology: objects get a reference to an io_context and own their own strand, so I think they are activatable, because they don't own really a running thread
You're mixing vague words a bit. "Activatable", "Strandify", "inter coorporating". They're all close to meaningful concepts, yet, narrowly avoid binding to any precise meaning.
Deconstructing
Let's simplify using more precise concepts.
Let's say we have objects of type A and type B, all of them served by the same io_context
I think it's more fruitful to say "types A and B have associated executors". When you make sure all operations on A and B operate from that executor and you make sure that executor serializes access, then you basically get the Active Object pattern.
[can build a cache based on information received via sockets] and posted operations to them
That's interesting. I take that to mean you don't directly call members of the class, unless they defer the actual execution to the strand. This, again, would be the Active Object.
However, your symptoms suggest that not all operations are "posted to them". Which implies they run on arbitrary threads, leading to your problem.
Would it be possible to access this information by using strands (without adding explicit mutex protection) and if yes how ?
The key to your problems is here. Data dependencies. It's also, ;ole;y going to limit the usefulness of scaling, unless of course the generation of information to retrieve from other threads is a computationally expensive operation.
However, in the light of the phrase _"to get information cached from multiple instances of B'" suggests that in fact, the data is instantaneous, and you'll just be paying synchronization costs for accessing across threads.
Questions
Q. Would it be possible to access this information by using strands (without adding explicit mutex protection) and if yes how ?
Technically, yes. By making sure all operations go on the strand, and the objects become true active objects.
However, there's an important caveat: strands aren't zero-cost. Only in certain contexts they can be optimized (e.g. in immediate continuations or when the execution context has no concurrency).
But in all other contexts, they end up synchronizing at similar cost as mutexes. The purpose of a strand is not to remove the lock contention. Instead it rather allows one to declaratively specify the synchronization requirements for tasks, so that so that the same code can be correctly synchronized regardless of the methods of async completion (using callbacks, futures, coroutines, awaitables, etc) or the chosen execution context(s).
Example: I recently uncovered a vivid illustration of the cost of strand synchronization even in a simple context (where serial execution was already implicitly guaranteed) here:
sehe mar 15, 23:08 Oh cool. The strands were unnecessary. I add them for safety until I know it's safe to go without. In this case the async call chains form logical strands (there are no timers or full duplex sockets going on, so it's all linear). That... improves the situation :)
Now it's 3.5gbps even with the 1024 byte server buffer
The throughput increased ~7x from just removing the strand.
Q. If not, what strategy could be adopted to achieve the scalability?
I suspect you really want caches that contain shared_futures. So that the first retrieval puts the future for the result in cache, where subsequent retrievals get the already existing shared future immediately.
If you make sure your cache lookup datastructure is threadsafe, likely with a reader/writer lock (shared_mutex), you will be free to access it with minimal overhead from any actor, instead of requiring to go through individual strands of each producer.
Keep in mind that awaiting futures is a blocking operation. So, if you do that from tasks posted on the execution context, you may easily run out of threads. In such cases it maybe better to provide async_get in terms of boost::asio::async_result or boost::asio::async_completion so you can wait in non-blocking fashion.

How do functional languages handle shared state data?

I've been learning about functional programming and see that it can certainly make parallelism easier to handle, but I do not see how it makes handling shared resources easier. I've seen people talk about variable immutability being a key factor, but how does that help two threads accessing the same resource? Say two threads are adding a request to a queue. They both get a copy of the queue, make a new copy with their request added (since the queue is immutable), and then return the new queue. The first request to return will be overridden by the second, as the copies of the queue each thread got did not have the other thread's request present. So I assume there is a locking mechanism a la mutex available in functional languages? How then does that differ from an imperative approach to the problem? Or do practical applications of functional programming still require some imperative operations to handle shared resources?
As soon as your global data can be updated. you're breaking the pure functional paradigm. In that sense, you need some sort of imperative structure. However, this is important enough that most functional languages offer a way to do this, and you need it to be able to communicate with the rest of the world anyway. (The most complicated formal one is the IO monad of Haskell.) Apart from simple bindings for some other synchronization library, they would probably try to implement a lock-free, wait-free data structure if possible.
Some approaches include:
Data that is written only once and never altered can be accessed safely with no locks or waiting on most CPUs. (There is typically a memory fence instruction to ensure that the memory updates in the right order for both the producer and the consumer.)
Some data structures, such as a difference list, have the property that you can tack on updates without invalidating any existing data. Let's say you have the association list [(1,'a'), (2,'b'), (3,'c')] and you want to update by changing the third entry to 'g'. If you express this as (3,'g'):originalList, then you can update the current list with the new version, and keep originalList valid and unaltered. Any thread that saw it can still safely use it.
Even if you have to work around the garbage collector, each thread can make its own thread-local copy of the shared state so long as the original does not get deleted while it is being copied. The underlying low-level implementation would be a producer/consumer model that atomically updates a pointer to the state data and inserts memory-fence instructions before the update and the copy operations.
If the program has a way to compare-and-swap atomically, and the garbage collector is aware, each thread can use the receive-copy-update pattern. A thread-aware garbage collector will keep the older data around as long as any thread is using it, and recycle it when the last thread is done with it. This should not require locking in software (for example, on modern ISAs, incrementing or decrementing a word-sized counter is an atomic operation, and atomic compare-and-swap is wait-free).
The functional language can add an extension to call an IPC library written in some other language, and update data in place. In Haskell, this would be defined with the IO monad to ensure sequential memory consistency, but nearly every functional language has some way to exchange data with the system libraries.
So, a functional language does offer some guarantees that are useful for efficient concurrent programming. For example, most current ISAs impose no extra overhead on multiple reader threads when there is at most a single writer, certain consistency bugs cannot occur, and functional languages are well-suited to express this pattern.

Code examples that use fine grained locks (JCR Jackrabbit?)

I'm doing an academic research in trying to develop a programming tool that assists in implementing fine-grained locking functions, for concurrent programs that maintain tree-like data structures.
For example, the programmer may write some functions that receive a tree root-node and modify the tree (by traversing on some routes and adding/removing nodes), and the tool will help him to find where in the code nodes should be locked and where they can be released - so the functions could be executed concurrently on the same tree.
I am looking for some real-life code examples in which such fine-grained locking is used, or where it could be used for better performance but the programmer was too lazy to implement it (for example, he locked the whole tree during the function-call without releasing useless nodes).
I read about JCR & Jackrabbit, which use a tree-shaped database, and found an article that explains how to lock nodes in JCR (but without examples):
http://www.day.com/specs/jcr/2.0/17_Locking.html
I have a really small background in databases, and I don't fully understand what is allowed and what is not allowed when it comes to Jackrabbit databases and concurrency. Accessing the same node from 2 threads is not allowed, but what about different repositories? And what happens if 2 different clients try to access the same node (for example, one tries to delete it, and another one tries to modify it - will the session.save() fail?).
Thanks,
Oren
First of all, don't get confused between databases/jackrabbit/locking. Jackrabbit implements its own locking, as do databases.
Jackrabbit allows you to lock nodes using LockManager.lock(). Setting the isDeep parameter to true means all the nodes underneath will also be locked. A locked node can be read by another session but can't be modified.
Technically speaking, 2 threads COULD edit the same node if they are using the same session but this is rather hazardous and should probably be avoided.
If a node is likely to be modified by 2 concurrent sessions then you should always lock the node. Which ever session gets there last should wait for the lock to be released. If you don't lock then at least one of the sessions will throw an exception.
I'm not sure what you mean by accessing nodes from different repositories. A node can only belong to one repository. If you mean having 2 jackrabbit instances accessing the same database then this too should be avoided or you should look to using clustering.
When implementing locking it is going to depend on your design and requirements. There's no point locking if you'll only ever have one session and vice-versa. Whether you lock a node or sub-tree is going to depend on what your data represents. For example, if a node represents a folder you'll probably want to lock just the node and not the whole sub-tree. If a sub-tree represents a complex document then you probably will want to lock the sub-tree.
As for locking the whole tree, I hope I don't meet someone who does that!

multithreaded linked list in C++

First a little explanation of what I am trying to do:
My plan is to write a program with a socket stream implemented using the boost::asio library which feeds data to a parser implemented using boost:spirit::qi. The parser will take the packets and fill a packet object and then append that object to the end of a linked list of packet objects. The packet processor will read the first object in the list and do its processing and then move onto the next item and delete the first.
I decided to use a linked list because a if I used a std::queue I would have to lock the entire container every time the stream added a packet or the processor removed one which would make the two threads run more or less serially, which I would like to avoid. Plus the queue class has a tendency to copy the entire objects whereas the linked list idea has the benefit of creating the object once and then just pointing to it. To avoid serializing this whole business I intend to place boost:mutex mutexes in each node and locking them from there. The idea is to have the socket stream create the list and immediately lock the first node, populate the node from the parser, create the next node and lock it, then unlock the first node and move on to the next node to do work. This way there's never an unlocked node dangling at the end that the packet procesor may jump to and delete under the socket streams nose. The packet processor will check the first node and try to lock it, if it locks it then it will do its processing and then unlock it, get the next node and then delete the first node. This way serialization is limited to those times when the packet processor has caught up to the socket stream class.
So now my question is, before I do the work of actually implementing this, does this sound like a good idea? I've tried it on a trivial test and it seems to work alright an I can't think of any serious issues with this as long as I implement exception handling and take care to free any memory I allocate, but if anyone can think of any problems with this idea that I've overlooked I would appreciate the input. Also I would appreciate any other suggestions anyone might have either as an alternative or that might make this idea work better.
Thanks in advance!
Check this article, it's about multiple consumers, but still brillant:
Measuring Parallel Performance: Optimizing a Concurrent Queue
This implementation is screaming three things at me:
Way too easy to get deadlock because insert and delete will need to lock multiple mutexes simultaneously, and you can't do that simultaneously. Well, you can, but you would need to put mutexes around your mutexes.
Way too easy to get corruption. The problems that lead to deadlock can also lead to corruption.
And slow, slow, slow. Think of your poor list walker. Each step involves an unlock, a lock, another unlock, and another lock. is going to have to be very careful in stepping, and man, will it be expensive. Lock and unlock each item, and do so in the correct order? Ouch.
This looks like a case of premature optimization and premature pessimization operating in parallel. What drove this architecture?
I suggest starting with the simple solution first. Lock the whole thing each time whenever you want to touch any part of it. See if that gets you in trouble. If it doesn't, problem solved. If it does, a next step up is to switch from mutexes to read-write locks. Just one, not a gazillion. Now you have to think a bit about whether you want a shared or exclusive lock. If you have been diligent about const-correctness, try using a shared lock (read lock) for your const methods, an exclusive lock (write lock) for your non-const methods.
I don't think what you suggest will work. Remember that when you delete a node from a linked list, you need to update the other nodes that point to the deleted node. Similarly, when you add a node you also need to update other nodes to point at the new node. So just locking the node being deleted or added isn't enough.
There are lock-free queues, but they are fiendishly difficult to get correct. For instance, look at the initial comments to the article here describing the extra work required to get a published algorithm to work.
Even though you are calling this a linked list, this is, in effect, a queue.
It is possible to implement Single Producer Single Consumer lock-free queues, if you are willing to use a fixed-size buffer. This let you control the memory usage at the cost of making the Producer wait if the Consumer is not fast enough.
Apart from this slight point, your design looks fine; it will probably be easier to get it right than the lock-free alternative.
Do remember to have a termination condition (a null pointer in the next field for example) so that the Producer may signal to the Consumer that there is no more things to process.
Hmm.. why such a complex solution to a common problem? There are plenty of producer-consumer queue classes out there - pick a reference/pointer/int sized one that works, (ie. no data copying).
Get bytes from your stream until you have assembled a 'packet object' according to your protocol. Shove the packet object reference onto the queue and immediately new() another one for the next packet. The packet processor consumes packet objects from the queue.
The queue is only locked for the time taken to push/pop an object reference to/from the queue. The packet object being assembled by the socket/stream callback and the packet object being processed by the packet processor are always different, so no locking required.
Trying to operate on objects that are already in the queue, (or queue-like linked-list), sounds like a nightmare to me, (and the other posters seem to agree). Is there some other reason why you need to do this?
Rgds,
Martin

Is checking current thread inside a function ok?

Is it ok to check the current thread inside a function?
For example if some non-thread safe data structure is only altered by one thread, and there is a function which is called by multiple threads, it would be useful to have separate code paths depending on the current thread. If the current thread is the one that alters the data structure, it is ok to alter the data structure directly in the function. However, if the current thread is some other thread, the actual altering would have to be delayed, so that it is performed when it is safe to perform the operation.
Or, would it be better to use some boolean which is given as a parameter to the function to separate the different code paths?
Or do something totally different?
What do you think?
You are not making all too much sense. You said a non-thread safe data structure is only ever altered by one thread, but in the next sentence you talk about delaying any changes made to that data structure by other threads. Make up your mind.
In general, I'd suggest wrapping the access to the data structure up with a critical section, or mutex.
It's possible to use such animals as reader/writer locks to differentiate between readers and writers of datastructures but the performance advantage for typical cases usually wont merit the additional complexity associated with their use.
From the way your question is stated, I'm guessing you're fairly new to multithreaded development. I highly suggest sticking with the simplist and most commonly used approaches for ensuring data integrity (most books/articles you readon the issue will mention the same uses for mutexes/critical sections). Multithreaded development is extremely easy to get wrong and can be difficult to debug. Also, what seems like the "optimal" solution very often doesn't buy you the huge performance benefit you might think. It's usually best to implement the simplist approach that will work then worry about optimizing it after the fact.
There is a trick that could work in case, as you said, the other threads will only make changes only once in a while, although it is still rather hackish:
make sure your "master" thread can't be interrupted by the other ones (higher priority, non fair scheduling)
check your thread
if "master", just change
if other, put off scheduling, if needed by putting off interrupts, make change, reinstall scheduling
really test to see whether there are no issues in your setup.
As you can see, if requirements change a little bit, this could turn out worse than using normal locks.
As mentioned, the simplest solution when two threads need access to the same data is to use some synchronization mechanism (i.e. critical section or mutex).
If you already have synchronization in your design try to reuse it (if possible) instead of adding more. For example, if the main thread receives its work from a synchronized queue you might be able to have thread 2 queue the data structure update. The main thread will pick up the request and can update it without additional synchronization.
The queuing concept can be hidden from the rest of the design through the Active Object pattern. The activ object may also be able to publish the data structure changes through the Observer pattern to other interested threads.