c++ multithread slow processing

c++ multithread slow processing - c++

I have multiple consumer threads and one producer thread. Producer thread writes the data into a map belong to a certain consumer thread and sends a signal to the consumer thread. I am using mutexes around the map when I am inserting and erasing the data. however this approach looks not efficient in terms of speed performance. Can you suggest another approach instead of map which requires mutex locks and unlocks and I think mutex slows down the transmission.

however this approach looks not efficient in terms of speed performance. Can you suggest another approach instead of map which requires mutex locks and unlocks and I think mutex slows down the transmission.
You should use a profiler to identify where the bottleneck is.
Producer thread writes the data into a map belong to a certain consumer thread and sends a signal to the consumer thread.
The producer should not be concerned what kind of data structure the consumer uses - it is a consumer's implementation detail. Keep in mind that inserting a value into a map requires a memory allocation (unless you are using a custom allocator) and memory allocation internally takes locks as well to protect the state of the heap. The end result is that locking a mutex around map::insert operation may lock it for too long actually.
A simpler and more efficient design would be to have an atomic queue between the producer and consumer (e.g. pipe, TBB concurrent_bounded_queue which pre-allocates its storage so that push/pop operations are really quick). Since your producer communicates directly to each consumer that queue is one-writer-one-reader and it can be implemented as a wait-free queue (or ring buffer a-la C++ disruptor).

Andrei Alexandrescu made the good point in that you should measure your code (https://www.facebook.com/notes/facebook-engineering/three-optimization-tips-for-c/10151361643253920) and this is the same advice I would give you, which is to measure your code and see what performance differences you are getting between a baseline test and your test running single threaded:
Time required to insert data using single thread to map
with above listed data
Time required to insert data
using single thread to map with above listed data and using mutex
locks
If you are still looking for a thread-safe container, you may want to look at Intel's open-source implementation of thread-safe containers at http://www.threadingbuildingblocks.org/docs/help/reference/containers_overview/concurrent_queue_cls.htm .
Also, as a suggestion for the consumer thread implementation, you may want to read the ActiveObject article that Herb Sutter posted on his website: http://herbsutter.com/2010/07/12/effective-concurrency-prefer-using-active-objects-instead-of-naked-threads/
If you can provide some more details, like why the map has to be locked all the time, we may be able to draft up a mechanism that is better performing.

Related

Is there a awaitable queue in c++?

I use concurrency::task from ppltasks.h heavily in my codebase.
I would like to find a awaitable queue, where I can do "co_await my_queue.pop()". Has anyone implemented one?
Details:
I have one producer thread that pushes elements to a queue, and another receiver thread would be waiting and waking up when elements arrive in the queue. This receiving thread might wait/wake up to handle other tasks in the meantime (using pplpp::when_any).
I don't want a queue with an interface where i have to poll a try_pop method as that is slow, and I don't want a blocking_pop method as that means I can't handle other ready tasks in the meantime.

This is basically your standard thread-safe queue implementation, but instead of a condition_variable, you will have to use futures to coordinate the different threads. You can then co_await on the future returned by pop to become ready.
The queue's implementation will need to keep a list of the promises that correspond to the outstanding pop calls. In case that the queue is still full when poping, you can return a ready future immediately. You can use plain old std::mutex to synchronize concurrent access to the underlying data structures.
I don't know of any implementation that already does this, but it shouldn't be too hard to pull off. Note though that managing all the futures will introduce some additional overhead, so your queue will probably be slightly less efficient than the classic condition_variable-based approach.

Posted a comment but I might as well write this as the answer since its long an I need formatting.
Basically you're two options are:
Lock-free queues, the most popular of which is this:
https://github.com/cameron314/concurrentqueue
They do have try_pop, because it uses atomic pointer and any atomic methods (e.g. std::atomic_compare_exchange_weak) can and will "fail" and return false at times, so you are forced to have a spin-lock over them.
You may find queues that abstract this inside a "pop" which just calls "try_pop" until it works, but that's the same overhead in the backround.
Lock-base queues:
These are easier to do on your own, without a third part library, just wrap every method you need in locks, if you want to 'peek' very often look into using shared_locks, otherwise just std::lock_guard should be enough to guard all wrapper. However this is what you may call a 'blocking' queue since during an access, weather it is to read or to write, the whole queue will be locked.
There is not thread-safe alternatives to these two implementations. If you are in need of a really large queue (e.g. hundreds of GBs of memory worth of objects) under heavy usage you can consider writing some custom hybrid data structure, but for most usecases moodycamel's queue will be more than sufficient an.

thread safe producer/consumer pattern in c++

How Can I develop a producer/ consumer pattern which is thread safe?
in my case, the producer runs in a thread and the consumer runs on another thread.
Is std::deque is safe for this purpose?
can I push_back to the back of a deque in one thread and push_front in another thread?
Edit 1
In my case, I know the maximum number of items in the std::deque (for example 10). Is there any way that I can reserve enough space for items beforehand so during processing, there was no need to change the size of queue memory and hence make sure when I am adding pushing data to back, no change could be happen on front data?

STL C++ containers are not thread-safe: if you decide for them, you need to use proper synchronizations (basically std::mutex and std::lock) when pushing/popping elements.
Alternatively you can use properly designed containers (single-producer-single-consumer queues should fit your needs), one example of them here: http://www.boost.org/doc/libs/1_58_0/doc/html/lockfree.html
addon after your EDIT:
yep, a SPSC queue is basically a ring buffer and definitively fits you needs.

How Can I develop a producer/ consumer pattern which is thread safe?
There are several ways, but using locks and monitors is fairly easy to grasp and doesn't have many hidden caveats. The standard library has std::unique_lock, std::lock_guard and std::condition_variable to implement the pattern. Check out the cppreference page of condition_variable for simple example.
Is std::deque is safe for this purpose?
It's not safe. You need synchronization.
can I push_back to the back of a deque in one thread and push_front in another thread?
Sure, but you need synchronization. There is a race condition when the queue is empty or has only one element. Also when the queue is full or one short of full in case you want to limit it's size.

I think you mean push_back() and pop_front().
std::deque is not thread-safe on its own.
You will need to serialise access using an std::mutex so the consumer isn't trying to pop while the producer is trying to push.
You should also consider how you handle the following:
How does the consumer behave if the deque is empty when it looks for the next item?
If it enters a wait state then you will need a std::condition_variable to be notified by the producer when the deque has been added to.
You may also need to handle program termination in which the consumer is waiting on the deque and the program is terminated. It could be left 'waiting forever' unless you orchestrate things correctly.
10 items is 'piffle' so I wouldn't bother about reserving space. std::deque grows and shrinks automatically so don't bother with fine grain tuning until you've built a working application.
Premature optimization is the root of all evil.
NB: It's not clear how your limiting the queue size but if the producer fills up the queue and then waits for it to clear back down you'll need more waits and conditions coming back the other way to coordinate.

STL containers thread-safeness for producer/consumer pattern

I am planning to do the following:
store a deque of pre-built objects to be consumed. The main thread might consume these objects here and there. I have another junky thread used for logging and other not time-critical but expensive things. When the pre-built objects are running low, I will refill them in the junky thread.
Now my question is, is there going to be race condition here? Technically one thread is consuming objects from the front, and another thread is pushing objects into the back. As long as I don't let the size run down to zero, it should be fine. The only thing that concerns me is the "size" of this deque. Do they store a integer "size" variable in STL containers? should modifying that size variable introduce race conditions?
What's the best way of solving this problem? I don't really want to use locks, because the main thread is performance critical (the reason I pre-built these objects in the first place!)

STL containers are not thread safe, period, don't play with this. Specifically the deque elements are usually stored in a chain of short arrays and that chain will be modified when operating with the deque, so there's a lot of room for messing things up.

Another option would be to have 2 deques, one for read another for write. The main thread reads, and the other writes. When the read deque is empty, switch the deques (just move 2 pointers), which would involve a lock, but only occasionally.
The consumer thread would drive the switch so it would only need to do a lock when switching. The producer thread would need to lock per write in case the switch happens in the middle of a write, but as you mention the consumer is less performance-critical, so no worries there.
What you're suggesting regarding no locks is indeed dangerous as others mention.

As #sharptooth mentioned, STL containers aren't thread-safe. Are you using a C++11 capable compiler? If so, you could implement a lock-free queue using atomic types. Otherwise you'd need to use assembler for compare-and-swap or use a platform specific API (see here). See this question to get information on how to do this.
I would emphasise that you should measure performance when using standard thread synchronisation and see if you do actually need a lock-free technique.

There will be a data race even with non-empty deque.
You'll have to protect all accesses (not just writes) to the deque through locks, or use a queue specifically designed for consumer-producer model in multi-threaded environment (such as Microsoft's unbounded_buffer).

what's the advantage of message queue over shared data in thread communication?

I read a article about multithread program design http://drdobbs.com/architecture-and-design/215900465, it says it's a best practice that "replacing shared data with asynchronous messages. As much as possible, prefer to keep each thread’s data isolated (unshared), and let threads instead communicate via asynchronous messages that pass copies of data".
What confuse me is that I don't see the difference between using shared data and message queues. I am now working on a non-gui project on windows, so let's use windows's message queues. and take a tradition producer-consumer problem as a example.
Using shared data, there would be a shared container and a lock guarding the container between the producer thread and the consumer thread. when producer output product, it first wait for the lock and then write something to the container then release the lock.
Using message queue, the producer could simply PostThreadMessage without block. and this is the async message's advantage. but I think there must exist some lock guarding the message queue between the two threads, otherwise the data will definitely corrupt. the PostThreadMessage call just hide the details. I don't know whether my guess is right but if it's true, the advantage seems no longer exist,since both two method do the same thing and the only difference is that the system hide the details when using message queues.
ps. maybe the message queue use a non-blocking containner, but I could use a concurrent container in the former way too. I want to know how the message queue is implemented and is there any performance difference bwtween the two ways?
updated:
I still don't get the concept of async message if the message queue operations are still blocked somewhere else. Correct me if my guess was wrong: when we use shared containers and locks we will block in our own thread. but when using message queues, myself's thread returned immediately, and left the blocking work to some system thread.

Message passing is useful for exchanging smaller amounts of data, because no conflicts need be avoided. It's much easier to implement than is shared memory for intercomputer communication. Also, as you've already noticed, message passing has the advantage that application developers don't need to worry about the details of protections like shared memory.
Shared memory allows maximum speed and convenience of communication, as it can be done at memory speeds when within a computer. Shared memory is usually faster than message passing, as message-passing are typically implemented using system calls and thus require the more time-consuming tasks of kernel intervention. In contrast, in shared-memory systems, system calls are required only to establish shared-memory regions. Once established, all access are treated as normal memory accesses w/o extra assistance from the kernel.
Edit: One case that you might want implement your own queue is that there are lots of messages to be produced and consumed, e.g., a logging system. With the implemenetation of PostThreadMessage, its queue capacity is fixed. Messages will most liky get lost if that capacity is exceeded.

Imagine you have 1 thread producing data,and 4 threads processing that data (presumably to make use of a multi core machine). If you have a big global pool of data you are likely to have to lock it when any of the threads needs access, potentially blocking 3 other threads. As you add more processing threads you increase the chance of a lock having to wait and increase how many things might have to wait. Eventually adding more threads achieves nothing because all you do is spend more time blocking.
If instead you have one thread sending messages into message queues, one for each consumer thread then they can't block each other. You stil have to lock the queue between the producer and consumer threads but as you have a separate queue for each thread you have a separate lock and each thread can't block all the others waiting for data.
If you suddenly get a 32 core machine you can add 20 more processing threads (and queues) and expect that performance will scale fairly linearly unlike the first case where the new threads will just run into each other all the time.

I have used a shared memory model where the pointers to the shared memory are managed in a message queue with careful locking. In a sense, this is a hybrid between a message queue and shared memory. This is very when large quantities of data must be passed between threads while retaining the safety of the message queue.
The entire queue can be packaged in a single C++ class with appropriate locking and the like. The key is that the queue owns the shared storage and takes care of the locking. Producers acquire a lock for input to the queue and receive a pointer to the next available storage chunk (usually an object of some sort), populates it and releases it. The consumer will block until the next shared object has released by the producer. It can then acquire a lock to the storage, process the data and release it back to the pool. In A suitably designed queue can perform multiple producer/multiple consumer operations with great efficiency. Think a Java thread safe (java.util.concurrent.BlockingQueue) semantics but for pointers to storage.

Of course there is "shared data" when you pass messages. After all, the message itself is some sort of data. However, the important distinction is when you pass a message, the consumer will receive a copy.
the PostThreadMessage call just hide the details
Yes, it does, but being a WINAPI call, you can be reasonably sure that it does it right.
I still don't get the concept of async message if the message queue operations are still blocked somewhere else.
The advantage is more safety. You have a locking mechanism that is systematically enforced when you are passing a message. You don't even need to think about it, you can't forget to lock. Given that multi-thread bugs are some of the nastiest ones (think of race conditions), this is very important. Message passing is a higher level of abstraction built on locks.
The disadvantage is that passing large amounts of data would be probably slow. In that case, you need to use need shared memory.
For passing state (i.e. worker thread reporting progress to the GUI) the messages are the way to go.

It's quite simple (I'm amazed others wrote such length responses!):
Using a message queue system instead of 'raw' shared data means that you have to get the synchronization (locking/unlocking of resources) right only once, in a central place.
With a message-based system, you can think in higher terms of "messages" without having to worry about synchronization issues anymore. For what it's worth, it's perfectly possible that a message queue is implemented using shared data internally.

I think this is the key piece of info there: "As much as possible, prefer to keep each thread’s data isolated (unshared), and let threads instead communicate via asynchronous messages that pass copies of data". I.e. use producer-consumer :)
You can do your own message passing or use something provided by the OS. That's an implementation detail (needs to be done right ofc). The key is to avoid shared data, as in having the same region of memory modified by multiple threads. This can cause hard to find bugs, and even if the code is perfect it will eat performance because of all the locking.

I had exact the same question. After reading the answers. I feel:
in most typical use case, queue = async, shared memory (locks) = sync. Indeed, you can do a async version of shared memory, but that's more code, similar to reinvent the message passing wheel.
Less code = less bug and more time to focus on other stuff.
The pros and cons are already mentioned by previous answers so I will not repeat.

Fastest implementation of one thread providing data, many threads consuming data

I have a lot of data that I want to disseminate to many different threads. This data is coming from a single thread. The consuming threads can safely access the container simultaneously.
The data needs to be merged into the container ever delta seconds (50ms < delta < 1), during which time the consuming threads need to be locked out, but not blocked. Similarly, when the data producer wants to merge in the data, it should wait until any reading threads are finished (which should be fast), but no one else should start reading as the update needs to occur as soon as possible.
I'm working on linux (platform specific solution is perfectly fine/expected) and I care about every millisecond. What sort of locking mechanisms should I use or is there an even better model for this problem?

If there is only one data producer thread and memory is not a consideration, you may want to consider using a merge and swap algorithm.
In it, the writer thread creates a copy of the data structure while readers continue to use the original, merges in new changes, then performs an exchange of the two structures within a mutex or critical section (or reader/writer lock). If your Unix platform supports interlocked exchange as an atomic operation, you can perform a lock-free exchange maximizing read throughput through they implementation.

It looks like you need to use the pthread read/write locks. They allow you to restrict access to one writer OR multiple readers. Look at pthread_rwlock_init to initialize the lock, pthread_rwlock_rdlock to acquire the lock for reading data, and pthread_rwlock_wrlock to acquire the lock for writing data.

Sounds like a good use for pthread read-write locks along with some thread-safe queues. The producer thread inserts items into the queue. The worker pool will pull items off of the queue and process the data. I'm not sure how the output will work but you might want to use a thread-safe queue here as well... maybe a priority queue to automatically merge the data if it makes sense.
The locked queue construct is nothing more than a mutex for exclusive locking, a std::queue for data storage, and a condition variable to wake up threads that are waiting on the queue. The enqueue method grabs the lock, inserts into the queue, releases the lock, and signals the condition. The dequeue method grabs the mutex, waits on the condition using the mutex as a guard, and dequeues any data that is there when it is woken up. This is a pretty standard producer-consumer style queue.
Before you roll your own solution, you might want to check out Boost.MPI and Boost.Thread. They both provide nicer C++ interfaces over the underlying OS implementation. I've used Boost.Thread a lot but it doesn't provide a nice message passing interface, but it does improve over pthread.
If you are really into multi-processing, you might want to give Boost.MPI or maybe Apache Qpid serious consideration. I plan on looking into Qpid and AMPQ for future projects since they both provide nice message-based interfaces.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js