multithreaded linked list in C++ - c++

First a little explanation of what I am trying to do:
My plan is to write a program with a socket stream implemented using the boost::asio library which feeds data to a parser implemented using boost:spirit::qi. The parser will take the packets and fill a packet object and then append that object to the end of a linked list of packet objects. The packet processor will read the first object in the list and do its processing and then move onto the next item and delete the first.
I decided to use a linked list because a if I used a std::queue I would have to lock the entire container every time the stream added a packet or the processor removed one which would make the two threads run more or less serially, which I would like to avoid. Plus the queue class has a tendency to copy the entire objects whereas the linked list idea has the benefit of creating the object once and then just pointing to it. To avoid serializing this whole business I intend to place boost:mutex mutexes in each node and locking them from there. The idea is to have the socket stream create the list and immediately lock the first node, populate the node from the parser, create the next node and lock it, then unlock the first node and move on to the next node to do work. This way there's never an unlocked node dangling at the end that the packet procesor may jump to and delete under the socket streams nose. The packet processor will check the first node and try to lock it, if it locks it then it will do its processing and then unlock it, get the next node and then delete the first node. This way serialization is limited to those times when the packet processor has caught up to the socket stream class.
So now my question is, before I do the work of actually implementing this, does this sound like a good idea? I've tried it on a trivial test and it seems to work alright an I can't think of any serious issues with this as long as I implement exception handling and take care to free any memory I allocate, but if anyone can think of any problems with this idea that I've overlooked I would appreciate the input. Also I would appreciate any other suggestions anyone might have either as an alternative or that might make this idea work better.
Thanks in advance!

Check this article, it's about multiple consumers, but still brillant:
Measuring Parallel Performance: Optimizing a Concurrent Queue

This implementation is screaming three things at me:
Way too easy to get deadlock because insert and delete will need to lock multiple mutexes simultaneously, and you can't do that simultaneously. Well, you can, but you would need to put mutexes around your mutexes.
Way too easy to get corruption. The problems that lead to deadlock can also lead to corruption.
And slow, slow, slow. Think of your poor list walker. Each step involves an unlock, a lock, another unlock, and another lock. is going to have to be very careful in stepping, and man, will it be expensive. Lock and unlock each item, and do so in the correct order? Ouch.
This looks like a case of premature optimization and premature pessimization operating in parallel. What drove this architecture?
I suggest starting with the simple solution first. Lock the whole thing each time whenever you want to touch any part of it. See if that gets you in trouble. If it doesn't, problem solved. If it does, a next step up is to switch from mutexes to read-write locks. Just one, not a gazillion. Now you have to think a bit about whether you want a shared or exclusive lock. If you have been diligent about const-correctness, try using a shared lock (read lock) for your const methods, an exclusive lock (write lock) for your non-const methods.

I don't think what you suggest will work. Remember that when you delete a node from a linked list, you need to update the other nodes that point to the deleted node. Similarly, when you add a node you also need to update other nodes to point at the new node. So just locking the node being deleted or added isn't enough.
There are lock-free queues, but they are fiendishly difficult to get correct. For instance, look at the initial comments to the article here describing the extra work required to get a published algorithm to work.

Even though you are calling this a linked list, this is, in effect, a queue.
It is possible to implement Single Producer Single Consumer lock-free queues, if you are willing to use a fixed-size buffer. This let you control the memory usage at the cost of making the Producer wait if the Consumer is not fast enough.
Apart from this slight point, your design looks fine; it will probably be easier to get it right than the lock-free alternative.
Do remember to have a termination condition (a null pointer in the next field for example) so that the Producer may signal to the Consumer that there is no more things to process.

Hmm.. why such a complex solution to a common problem? There are plenty of producer-consumer queue classes out there - pick a reference/pointer/int sized one that works, (ie. no data copying).
Get bytes from your stream until you have assembled a 'packet object' according to your protocol. Shove the packet object reference onto the queue and immediately new() another one for the next packet. The packet processor consumes packet objects from the queue.
The queue is only locked for the time taken to push/pop an object reference to/from the queue. The packet object being assembled by the socket/stream callback and the packet object being processed by the packet processor are always different, so no locking required.
Trying to operate on objects that are already in the queue, (or queue-like linked-list), sounds like a nightmare to me, (and the other posters seem to agree). Is there some other reason why you need to do this?
Rgds,
Martin

Related

What is the most efficient way to coordinate between threads about which threads are free?

I'm working on a program that sends messages between threads, it looks at which threads are busy, if one is free it grabs the first free one(or in some cases multiple free ones), marks it as taken, sends work to it and does it's own work, then once finished waits for it to complete. The part that is the bottleneck of all of this is coordinating between threads about which thread is taken. Seems like a problem I'm sure others have encountered, have some solutions to share, but also want to know if you can do better than me.
My solution ultimately boils down to:
Maintain a set representing indexes of free threads, and be able to grab an item from the set getting the index of a free thread or add it back to the set increasing the size by one. Order unimportant. I know the fixed size of the set in advance.
I've tried a few ways of doing this:
Maintain a single unsigned long long int and use '__builtin_clz'(Interesting __builtin_ffsll was 10x slower.. thinking not supported with a single instruction on my processor) to count the number of bits in a single instruction cycle and grab the lowest one and use a lookup table of bitmasks to flip bits on and off, simultaneously claiming their thread number. Loved this version because I only needed to share a single atomic unsigned long long and could use a single atomic operation but doing 'fetch_and' in a loop until you are right ended up slowing than locking and doing non-atomically. The version using locking ended up being faster, probably because threads didn't get stuck in loops repeating the same operations waiting for others to finish theirs.
Use a linked list, allocate all nodes in advance, maintain a head node and a list, if pointing to nullptr, then we've reached the end of the list. Have only done this with a lock because it needs two simultaneous operations.
Maintain an array that represents all indexes of threads to claim. Either increment an array index and return previous pointer to claim a thread, or swap the last taken thread with the one being freed and decrement the pointer. Check if free.
Use the moodycamel queue which maintains a lock free queue.
Happy to share C++ code, the answer was getting to be quite long though when I tried to include it.
All three are fast, __builtin_clzll is not universally supported, so even though a little faster, probably not enough so to be worth it and probably 10x slower on computers that don't natively support it, similar to how __builtin_ffsll was slow. Array and linked list are roughly as fast as each other, array seems slightly faster when no contention. Moody is 3x slower.
Think you can do better and have a faster way to do this? Still the slowest part of this process, still just barely being worth the cost in some cases.
Thoughts for directions to explore:
Feels like there should be a way using a couple of atomics, maybe an array of atomics, one at a time, have to maintain the integrity of the set with every operation though, which makes this tricky. Most solutions at some point need two operations to be done simultaneously, atomics seem like they could provide a significantly faster solution than locking in my benchmarking.
Might be able to use lock but remove the need to check if the list is empty or swap elements in array
Maybe use a different data structure, for example, two arrays, add to one while emptying the other, then switch which one is being filled and which is emptied. This means no need to swap elements but rather just swap two pointers to arrays and only when one is empty.
Could have threads launching threads add work to a list of work to be done, then another thread can grab it while this thread keeps going. Ultimately still need a similar thread safe set.
See if the brilliant people on stackoverflow see directions to explore that I haven't seen yet :)
All you need is a thread pool, a queue (a list, deque or a ring buffer), a mutex and a condition_variable to signal when a new work item has been added to the queue.
Wrap work items in packaged_task if you need to wait on the result of each task.
When adding a new work item to the queue, 1) lock the mutex, 2) add the item, 3) release the lock and 4) call cv::notify_one, which will unblock the first available thread.
Once the basic setup is working, if tasks are too fine-grained, work stealing can be added to the solution to improve performance. It's also important to use the main thread to perform part of the work instead of just waiting for all tasks to complete. These simple (albeit clunky) optimizations often result in >15% overall improvement in performance due to reduced context switching.
Also don't forget to think about false sharing. It might be a good idea to pad task items to 64 bytes, just to be on the safe side.

How can we use multi-threading while working with linked lists

I am fairly new to the concept of multi-threading and was exploring to see some interesting problems in order to get a better idea.
One of my friends suggested the following:
"It is fairly straight-forward to have a linked-list and do the regular insert, search and delete operations. But how would you do these operations if multiple threads need to work on the same list.
How many locks are required minimum. How many locks can we have to have optimized linked list functions?"
Giving some thought, I feel that a single lock should be sufficient to work with. We acquire the lock for every single read and write operation. By this I mean when we are accessing a node data in a list we acquire the lock. When we are inserting/deleting elements, we acquire lock for the complete series of steps.
But I was not able to think of a way where using more locks will give us more optimized performance.
Any help/pointers?
The logical extension of "one lock per list" would be "one lock per item".
The case when this would be useful would e.g. be if you're often only modifying a single item of the list.
For deletion and insertion, acquiring the proper locks gets more complicated, though. You'd have to acquire the lock for the item before and after, and you'd have to make sure to always acquire them in the same order (to prevent deadlocks). And there's of course also special cases to be considered if the root element has to be modified (and possibly also if it's a double-linked list or a circular linked list). This overhead resulting from the more complicated locking logic might lead to your implementation being slower again, especially if you often have to insert and delete from the list.
So I would only consider this if the majority of accesses is the modification of a single node.
If you're searching for peak performance for a specific use case, then in the end, it boils down to implementing both, and running performance comparisons for a typical scenario.
You definitely need at least one semaphore/lock to ensure list integrity.
But, presumably any operation on the list changes at most two nodes: The node being insert/changed/deleted and the adjacent node which points to it. So you could implement locking on a per-node basis, locking at most two nodes for a given operation. This would allow for a degree of concurrency when different threads accesses the list, though you'd need to distinguish between read and write locks to the get full benefit of this approach I think.
If you're new to multi-threading, embrace the notion that premature optimization is a waste of time. Linked lists are a very straight-forward data structure, and you can make it thread-safe by putting a critical section on all reads and writes. These will lock the thread into the CPU for the duration of the execution of the read/insert/delete operation, and ensure thread-safety. They also don't consume the overhead of a mutex lock, or more complicated locking mechanism.
If you want to optimize after the fact, only do so with a valid profiling tool that gives you raw numbers. The linked list operations will never end up being the biggest source of application slowdown, and it will probably never be worth your while to add in the node-level locking being discussed.
Using one lock for the entire list would completely defeat most reasons for multithreading in the first place. By locking the entire list down, you guarantee that only one thread can use the list at a time.
This is certainly safe in the sense that you will have no deadlocks or races, but it is naive and inefficient because you serialize access to the entire list.
A better approach would be to have a lock for each item in the list, and another one for the list itself. The latter would be needed when appending to the list, depending on how the list is implemented (eg, if it maintains a node count seperate from the nodes themselves).
However this might also be less than optimal depending on a number of factors. For instance, on some platforms mutexes might be expensive in terms of resources and time when instantiating the mutex. If space is at a premium, another approach might be to have a fixed-size pool of mutexes from which you draw whenever you need to access an item. These mutexes would have some kind of ownership flag which indicates which node they are allocated to, so that no other mutex would be allocated to that node at the same time.
Another technique is to use reader/write locks, which will allow read access to any thread, but write access to only one, the two being mutually exclusive. However it has been suggested in the literature that in many cases using a reader/write lock is actually less efficient than simply using a plain mutex. This will depend on your actual usage pattern and how the lock is implemented.
You only need to lock when you're writing and you say there's usually only one write, so try read/write locks.

concurrent garbage collection for a c++ graph data structure

I have a directed graph data structure used for audio signal processing (see http://audulus.com if you're curious).
I would like graph edges to be strong references, so in the absence of cycles, std::shared_ptr would do the trick. Alas, there are potentially cycles in the graph.
So, I had this idea for a simple concurrent mark-sweep collector:
The mutator thread sends events to the collector thread. The collector thread maintains its own representation of the graph and does not traverse the mutator thread's graph. The collector thread just uses mark-sweep at regular intervals.
The events would be the following (in function call form):
AddRoot(Node*)
RemoveRoot(Node*)
AddEdge(Node*, Node*)
RemoveEdge(Node*, Node*)
Is this scheme correct? The collector thread has an older version of what the mutator thread sees. My intuition is that since a node that is unreachable at an earlier time will still be unreachable at a later time, the collector thread may delete an unreachable object as soon as it finds one.
Also, if it's correct for one mutator thread, would it work for multiple mutator threads?
UPDATE
I've released the code here: https://github.com/audulus/collector. The code is actually fairly general purpose. Use RootPtr<T> to automatically keep track of root nodes. Links between nodes are managed using EdgePtr<T>.
The collector seems to work for multiple mutator threads (both in my app and in unit tests), but I feel like a proof of correctness is needed.
PLEASE NOTE (in repsonse to #AaronGolden's comment below, judging from the comments below, people aren't reading this): The mutator thread is responsible for calling the collector functions in the correct order. For example, if the mutator thread calls RemoveEdge(a,b) before assigning b to a RootPtr, the collector may intervene and collect b.
UPDATE 2:
I've updated the code to my latest version and updated the link above. I've now used the code in my app for over a year and haven't attributed any bugs to it.
One argument I think is somewhat persuasive (though I would hesitate to call it proof) that this scheme works is that in the absence of cycles, the scheme is equivalent to reference counting with atomic reference counts.
In the absence of cycles, AddRoot and AddEdge map to incrementing a reference count and RemoveRoot and RemoveEdge map to decrementing. Pushing an event onto the queue (I use boost::lockfree::queue) is an atomic operation just like the updating reference counts.
So then the remaining question is: how do cycles change the picture in terms of correctness? To wave hands a bit, cycles are a property of the connectivity of the graph, but don't have an effect on the atomicity of the operations or the ability of one thread to know something earlier than it would otherwise (causing a potential bad ordering of operations).
This would suggest that if there's a counterexample for the scheme, it will involve playing some game with cycles.
Is this scheme correct?
I'm concerned that you don't have any concept of safe points. In particular, can an update require more than one of your actions to be executed atomically? Perhaps it is ok because you can always add all vertices and edges in a batch before removing.
Also, if it's correct for one mutator thread, would it work for multiple mutator threads?
If one thread drops a root to a subgraph just after another picks up a root to the same subgraph then you must make sure you get the messages in-order which means you cannot use per-mutator queues. And a global queue is likely to kill scalability.
One of my constraints is that the GC has to be lock-free because of my real-time DSP thread
Is the allocator lock-free?
What if the GC cannot keep up with mutator(s)?
Also, I would recommend considering:
Forking and collecting in a child process.
Incremental mark-sweep with Dijkstra's tricolor marking sceme.
Baker's treadmill.
VCGC.
Separate per-thread heaps with deep copying of messages.

what's the advantage of message queue over shared data in thread communication?

I read a article about multithread program design http://drdobbs.com/architecture-and-design/215900465, it says it's a best practice that "replacing shared data with asynchronous messages. As much as possible, prefer to keep each thread’s data isolated (unshared), and let threads instead communicate via asynchronous messages that pass copies of data".
What confuse me is that I don't see the difference between using shared data and message queues. I am now working on a non-gui project on windows, so let's use windows's message queues. and take a tradition producer-consumer problem as a example.
Using shared data, there would be a shared container and a lock guarding the container between the producer thread and the consumer thread. when producer output product, it first wait for the lock and then write something to the container then release the lock.
Using message queue, the producer could simply PostThreadMessage without block. and this is the async message's advantage. but I think there must exist some lock guarding the message queue between the two threads, otherwise the data will definitely corrupt. the PostThreadMessage call just hide the details. I don't know whether my guess is right but if it's true, the advantage seems no longer exist,since both two method do the same thing and the only difference is that the system hide the details when using message queues.
ps. maybe the message queue use a non-blocking containner, but I could use a concurrent container in the former way too. I want to know how the message queue is implemented and is there any performance difference bwtween the two ways?
updated:
I still don't get the concept of async message if the message queue operations are still blocked somewhere else. Correct me if my guess was wrong: when we use shared containers and locks we will block in our own thread. but when using message queues, myself's thread returned immediately, and left the blocking work to some system thread.
Message passing is useful for exchanging smaller amounts of data, because no conflicts need be avoided. It's much easier to implement than is shared memory for intercomputer communication. Also, as you've already noticed, message passing has the advantage that application developers don't need to worry about the details of protections like shared memory.
Shared memory allows maximum speed and convenience of communication, as it can be done at memory speeds when within a computer. Shared memory is usually faster than message passing, as message-passing are typically implemented using system calls and thus require the more time-consuming tasks of kernel intervention. In contrast, in shared-memory systems, system calls are required only to establish shared-memory regions. Once established, all access are treated as normal memory accesses w/o extra assistance from the kernel.
Edit: One case that you might want implement your own queue is that there are lots of messages to be produced and consumed, e.g., a logging system. With the implemenetation of PostThreadMessage, its queue capacity is fixed. Messages will most liky get lost if that capacity is exceeded.
Imagine you have 1 thread producing data,and 4 threads processing that data (presumably to make use of a multi core machine). If you have a big global pool of data you are likely to have to lock it when any of the threads needs access, potentially blocking 3 other threads. As you add more processing threads you increase the chance of a lock having to wait and increase how many things might have to wait. Eventually adding more threads achieves nothing because all you do is spend more time blocking.
If instead you have one thread sending messages into message queues, one for each consumer thread then they can't block each other. You stil have to lock the queue between the producer and consumer threads but as you have a separate queue for each thread you have a separate lock and each thread can't block all the others waiting for data.
If you suddenly get a 32 core machine you can add 20 more processing threads (and queues) and expect that performance will scale fairly linearly unlike the first case where the new threads will just run into each other all the time.
I have used a shared memory model where the pointers to the shared memory are managed in a message queue with careful locking. In a sense, this is a hybrid between a message queue and shared memory. This is very when large quantities of data must be passed between threads while retaining the safety of the message queue.
The entire queue can be packaged in a single C++ class with appropriate locking and the like. The key is that the queue owns the shared storage and takes care of the locking. Producers acquire a lock for input to the queue and receive a pointer to the next available storage chunk (usually an object of some sort), populates it and releases it. The consumer will block until the next shared object has released by the producer. It can then acquire a lock to the storage, process the data and release it back to the pool. In A suitably designed queue can perform multiple producer/multiple consumer operations with great efficiency. Think a Java thread safe (java.util.concurrent.BlockingQueue) semantics but for pointers to storage.
Of course there is "shared data" when you pass messages. After all, the message itself is some sort of data. However, the important distinction is when you pass a message, the consumer will receive a copy.
the PostThreadMessage call just hide the details
Yes, it does, but being a WINAPI call, you can be reasonably sure that it does it right.
I still don't get the concept of async message if the message queue operations are still blocked somewhere else.
The advantage is more safety. You have a locking mechanism that is systematically enforced when you are passing a message. You don't even need to think about it, you can't forget to lock. Given that multi-thread bugs are some of the nastiest ones (think of race conditions), this is very important. Message passing is a higher level of abstraction built on locks.
The disadvantage is that passing large amounts of data would be probably slow. In that case, you need to use need shared memory.
For passing state (i.e. worker thread reporting progress to the GUI) the messages are the way to go.
It's quite simple (I'm amazed others wrote such length responses!):
Using a message queue system instead of 'raw' shared data means that you have to get the synchronization (locking/unlocking of resources) right only once, in a central place.
With a message-based system, you can think in higher terms of "messages" without having to worry about synchronization issues anymore. For what it's worth, it's perfectly possible that a message queue is implemented using shared data internally.
I think this is the key piece of info there: "As much as possible, prefer to keep each thread’s data isolated (unshared), and let threads instead communicate via asynchronous messages that pass copies of data". I.e. use producer-consumer :)
You can do your own message passing or use something provided by the OS. That's an implementation detail (needs to be done right ofc). The key is to avoid shared data, as in having the same region of memory modified by multiple threads. This can cause hard to find bugs, and even if the code is perfect it will eat performance because of all the locking.
I had exact the same question. After reading the answers. I feel:
in most typical use case, queue = async, shared memory (locks) = sync. Indeed, you can do a async version of shared memory, but that's more code, similar to reinvent the message passing wheel.
Less code = less bug and more time to focus on other stuff.
The pros and cons are already mentioned by previous answers so I will not repeat.

Is checking current thread inside a function ok?

Is it ok to check the current thread inside a function?
For example if some non-thread safe data structure is only altered by one thread, and there is a function which is called by multiple threads, it would be useful to have separate code paths depending on the current thread. If the current thread is the one that alters the data structure, it is ok to alter the data structure directly in the function. However, if the current thread is some other thread, the actual altering would have to be delayed, so that it is performed when it is safe to perform the operation.
Or, would it be better to use some boolean which is given as a parameter to the function to separate the different code paths?
Or do something totally different?
What do you think?
You are not making all too much sense. You said a non-thread safe data structure is only ever altered by one thread, but in the next sentence you talk about delaying any changes made to that data structure by other threads. Make up your mind.
In general, I'd suggest wrapping the access to the data structure up with a critical section, or mutex.
It's possible to use such animals as reader/writer locks to differentiate between readers and writers of datastructures but the performance advantage for typical cases usually wont merit the additional complexity associated with their use.
From the way your question is stated, I'm guessing you're fairly new to multithreaded development. I highly suggest sticking with the simplist and most commonly used approaches for ensuring data integrity (most books/articles you readon the issue will mention the same uses for mutexes/critical sections). Multithreaded development is extremely easy to get wrong and can be difficult to debug. Also, what seems like the "optimal" solution very often doesn't buy you the huge performance benefit you might think. It's usually best to implement the simplist approach that will work then worry about optimizing it after the fact.
There is a trick that could work in case, as you said, the other threads will only make changes only once in a while, although it is still rather hackish:
make sure your "master" thread can't be interrupted by the other ones (higher priority, non fair scheduling)
check your thread
if "master", just change
if other, put off scheduling, if needed by putting off interrupts, make change, reinstall scheduling
really test to see whether there are no issues in your setup.
As you can see, if requirements change a little bit, this could turn out worse than using normal locks.
As mentioned, the simplest solution when two threads need access to the same data is to use some synchronization mechanism (i.e. critical section or mutex).
If you already have synchronization in your design try to reuse it (if possible) instead of adding more. For example, if the main thread receives its work from a synchronized queue you might be able to have thread 2 queue the data structure update. The main thread will pick up the request and can update it without additional synchronization.
The queuing concept can be hidden from the rest of the design through the Active Object pattern. The activ object may also be able to publish the data structure changes through the Observer pattern to other interested threads.