Having Thread local queue's with counters - c++

I have four threads which has its own private queue and a private'int count' member, whenever a task is produced from the program thread, it should be enqueued to a thread's queue which has minimum 'int count' among the threads.
whenever a task is pushed into the queue, the private 'int count' should be increased by 1
whenever a task is popped out of the queue, the private 'int count' should be decreased by 1
so, the 'int count' is dynamically changing regarding to tasks push,pop operation and the program thread will dispatch the task to the queue with the lowest, (or first zero found), count.
This is the underlying logic of the program.
I am working in c++ programing language in linux multithreading library implementing a multi-rate synchronous data flow paradigm.
could you please give some coding ideas for implemenating this logic. ie.
1.Initializing all private int queue counter =0
2.counter++ when task are pushed,
3.counter-- when tasks are popped,
4.Task disptacher sees the private int count of each thread.
5.Dispatches tasks to queue which has minimum count

Ok, so Basically your program thread is the producer and you have 4 consumer threads. By using a queue in each thread you will be minimizing the time spent by the main thread interacting with the consumers. N.B. You need to consider whether your threads are going to be starved / or overflow - I.E. if the single producer will create "work" at a rate that warrants 4 consumers, or if 4 consumers will be swamped.
naive approach
So you need to synchronize the queue access / increment meaning that you need a mutex to stop the consumer accessing anything while the count and queue are modified. The easiest way is to do the synchronization would be to have a method (E.G. enqueue(Item& item) ) which locks the mutex within it.
C++11 : Mutex http://en.cppreference.com/w/cpp/thread/mutex
Additionally if starvation is an issue (or overflow) you will need to use some signalling to stop the relevant threads activity (Starved - stop consumers to avoid CPU usage, Overflow - stop producer while consumers catch up). Usually these signals are implemented using condition variables.
C++11 : Condition variables : http://en.cppreference.com/w/cpp/thread/condition_variable
So the situation is slightly complicated here, in that the threads that you want to populate will be the ones with the least work to do. This requires that you inspect the 4 counts and choose the queue. However because there is only one producer you can probably just scan for the queue without locking. The logic here is that the consumers will not be affected by the read, and the choice of thread would not really be incorrect even with the consumers working during that choice.
So I would have an array of thread objects, each of which would have the count, and a mutex for locking.
Initialize the counts in the constructors - make sure that the producer isn't working during initialization and synchronization won't be an issue.
Implement 2 methods on the thread object to do the enqueing / dequeuing and in each use a lock_guard to lock the mutex (RAII technique). Then push/pop item to/from the queue and increment/decrement as applicable.
C++11: lock_guard http://en.cppreference.com/w/cpp/thread/lock_guard
As I said above if there is only one you can simply scan through the array of objects and choose (maintain an index to) the thread object where the counter (add a getCount() method)is the lowest. It will most likely be the lowest even with the consumers continuing their work.
If there are multiple threads producing work then you might need to think about how you want to handle the 2 threads enquing to the same thread (It might not matter)


Where can we use std::barrier over std::latch?

I recently heard new c++ standard features which are:
I cannot figure it out ,in which situations that they are applicable and useful over one-another.
If someone can raise an example for how to use each one of them wisely it would be really helpful.
Very short answer
They're really aimed at quite different goals:
Barriers are useful when you have a bunch of threads and you want to synchronise across of them at once, for example to do something that operates on all of their data at once.
Latches are useful if you have a bunch of work items and you want to know when they've all been handled, and aren't necessarily interested in which thread(s) handled them.
Much longer answer
Barriers and latches are often used when you have a pool of worker threads that do some processing and a queue of work items that is shared between. It's not the only situation where they're used, but it is a very common one and does help illustrate the differences. Here's some example code that would set up some threads like this:
const size_t worker_count = 7; // or whatever
std::vector<std::thread> workers;
std::vector<Proc> procs(worker_count);
Queue<std::function<void(Proc&)>> queue;
for (size_t i = 0; i < worker_count; ++i) {
[p = &procs[i], &queue]() {
while (auto fn = queue.pop_back()) {
There are two types that I have assumed exist in that example:
Proc: a type specific to your application that contains data and logic necessary to process work items. A reference to one is passed to each callback function that's run in the thread pool.
Queue: a thread-safe blocking queue. There is nothing like this in the C++ standard library (somewhat surprisingly) but there are a lot of open-source libraries containing them e.g. Folly MPMCQueue or moodycamel::ConcurrentQueue, or you can build a less fancy one yourself with std::mutex, std::condition_variable and std::deque (there are many examples of how to do this if you Google for them).
A latch is often used to wait until some work items you push onto the queue have all finished, typically so you can inspect the result.
std::vector<WorkItem> work = get_work();
std::latch latch(work.size());
for (WorkItem& work_item : work) {
queue.push_back([&work_item, &latch](Proc& proc) {
// Inspect the completed work
How this works:
The threads will - eventually - pop the work items off of the queue, possibly with multiple threads in the pool handling different work items at the same time.
As each work item is finished, latch.count_down() is called, effectively decrementing an internal counter that started at work.size().
When all work items have finished, that counter reaches zero, at which point latch.wait() returns and the producer thread knows that the work items have all been processed.
The latch count is the number of work items that will be processed, not the number of worker threads.
The count_down() method could be called zero times, one time, or multiple times on each thread, and that number could be different for different threads. For example, even if you push 7 messages onto 7 threads, it might be that all 7 items are processed onto the same one thread (rather than one for each thread) and that's fine.
Other unrelated work items could be interleaved with these ones (e.g. because they weree pushed onto the queue by other producer threads) and again that's fine.
In principle, it's possible that latch.wait() won't be called until after all of the worker threads have already finished processing all of the work items. (This is the sort of odd condition you need to look out for when writing threaded code.) But that's OK, it's not a race condition: latch.wait() will just immediately return in that case.
An alternative to using a latch is that there's another queue, in addition to the one shown here, that contains the result of the work items. The thread pool callback pushes results on to that queue while the producer thread pops results off of it. Basically, it goes in the opposite direction to the queue in this code. That's a perfectly valid strategy too, in fact if anything it's more common, but there are other situations where the latch is more useful.
A barrier is often used to make all threads wait simultaneously so that the data associated with all of the threads can be operated on simultaneously.
typedef Fn std::function<void()>;
Fn completionFn = [&procs]() {
// Do something with the whole vector of Proc objects
auto barrier = std::make_shared<std::barrier<Fn>>(worker_count, completionFn);
auto workerFn = [barrier](Proc&) {
for (size_t i = 0; i < worker_count; ++i) {
How this works:
All of the worker threads will pop one of these workerFn items off of the queue and call barrier.count_down_and_wait().
Once all of them are waiting, one of them will call completionFn() while the others continue to wait.
Once that function completes they will all return from count_down_and_wait() and be free to pop other, unrelated, work items from the queue.
Here the barrier count is the number of worker threads.
It is guaranteed that each thread will pop precisely one workerFn off of the queue and handle it. Once a thread has popped one off of the queue, it will wait in barrier.count_down_and_wait() until all the other copies of workerFn have been popped off by other threads, so there is no chance of it popping another one off.
I used a shared pointer to the barrier so that it will be destroyed automatically once all the work items are done. This wasn't an issue with the latch because there we could just make it a local variable in the producer thread function, because it waits until the worker threads have used the latch (it calls latch.wait()). Here the producer thread doesn't wait for the barrier so we need to manage the memory in a different way.
If you did want the original producer thread to wait until the barrier has been finished, that's fine, it can call count_down_and_wait() too, but you will obviously need to pass worker_count + 1 to the barrier's constructor. (And then you wouldn't need to use a shared pointer for the barrier.)
If other work items are being pushed onto the queue at the same time, that's fine too, although it will potentially waste time as some threads will just be sitting there waiting for the barrier to be acquired while other threads are distracted by other work before they acquire the barrier.
!!! DANGER !!!
The last bullet point about other working being pushed onto the queue being "fine" is only the case if that other work doesn't also use a barrier! If you have two different producer threads putting work items with a barrier on to the same queue and those items are interleaved, then some threads will wait on one barrier and others on the other one, and neither will ever reach the required wait count - DEADLOCK. One way to avoid this is to only ever use barriers like this from a single thread, or even to only ever use one barrier in your whole program (this sounds extreme but is actually quite a common strategy, as barriers are often used for one-time initialisation on startup). Another option, if the thread queue you're using supports it, is to atomically push all work items for the barrier onto the queue at once so they're never interleaved with any other work items. (This won't work with the moodycamel queue, which supports pushing multiple items at once but doesn't guarantee that they won't be interleved with items pushed on by other threads.)
Barrier without completion function
At the point when you asked this question, the proposed experimental API didn't support completion functions. Even the current API at least allows not using them, so I thought I should show an example of how barriers can be used like that too.
auto barrier = std::make_shared<std::barrier<>>(worker_count);
auto workerMainFn = [&procs, barrier](Proc&) {
// Do something with the whole vector of Proc objects
auto workerOtherFn = [barrier](Proc&) {
barrier->count_down_and_wait(); // Wait for work to start
barrier->count_down_and_wait(); // Wait for work to finish
for (size_t i = 0; i < worker_count - 1; ++i) {
How this works:
The key idea is to wait for the barrier twice in each thread, and do the work in between. The first waits have the same purpose as the previous example: they ensure any earlier work items in the queue are finished before starting this work. The second waits ensure that any later items in the queue don't start until this work has finished.
The notes are mostly the same as the previous barrier example, but here are some differences:
One difference is that, because the barrier is not tied to the specific completion function, it's more likely that you can share it between multiple uses, like we did in the latch example, avoiding the use of a shared pointer.
This example makes it look like using a barrier without a completion function is much more fiddly, but that's just because this situation isn't well suited to them. Sometimes, all you need is to reach the barrier. For example, whereas we initialised a queue before the threads started, maybe you have a queue for each thread but initialised in the threads' run functions. In that case, maybe the barrier just signifies that the queues have been initialised and are ready for other threads to pass messages to each other. In that case, you can use a barrier with no completion function without needing to wait on it twice like this.
You could actually use a latch for this, calling count_down() and then wait() in place of count_down_and_wait(). But using a barrier makes more sense, both because calling the combined function is a little simpler and because using a barrier communicates your intention better to future readers of the code.
Any any case, the "DANGER" warning from before still applies.

Mutex granularity

I have a question regarding threads. It is known that basically when we call for mutex(lock) that means that thread keeps on executing the part of code uninterrupted by other threads until it meets mutex(unlock). (At least that's what they say in the book) So my question is if it is actually possible to have several scoped WriteLocks which do not interfere with each other. For example something like this:
If I have a buffer with N elements without any new elements coming, however with high frequency updates (like change value of Kth element) is it possible to set a different lock on each element so that the only time threads would stall and wait is if actually 2 or more threads are trying to update the same element?
To answer your question about N mutexes: yes, that is indeed possible. What resources are protected by a mutex depends entirely on you as the user of that mutex.
This leads to the first (statement) part of your question. A mutex by itself does not guarantee that a thread will work uninterrupted. All it guarantees is MUTual EXclusion - if thread B attempts to lock a mutex which thread A has locked, thread B will block (execute no code) until thread A unlocks the mutex.
This means mutexes can be used to guarantee that a thread executes a block of code uninterrupted; but this works only if all threads follow the same mutex-locking protocol around that block of code. Which means it is your responsibility to assign semantics (or meaning) to each individual mutex, and correctly adhere to those semantics in your code.
If you decide for the semantics to be "I have an array a of N data elements and an array m of N mutexes, and accessing a[i] can only be done when m[i] is locked," then that's how it will work.
The need to consistently stick to the same protocol is why you should generally encapsulate the mutex and the code/data protected by it in a class in some way or another, so that outside code doesn't need to know the details of the protocol. It just knows "call this member function, and the synchronisation will happen automagically." This "automagic" will be the class correcrtly implementing the protocol.
A crucial consideration when deciding between a mutex per array and a mutex per element is whether there are operations - like tracking the number of "in-use" array elements, the "active" element, or moving a pointer-to-array to a larger buffer - that can only be done safely by one thread while all the others are blocked.
A lesser but sometimes important consideration is the amount of extra memory more mutexes use.
If you genuinely need to do this kind of update as quickly as possible in a highly contested multi-threaded program, you may also want to learn about lock-free atomic types and their compare-and-swap / exchange operations, but I'd recommend against considering that unless profiling the existing locking is significant in your overall program performance.
A mutex does not stop other threads from running completely, it only stops other threads from locking the same mutex. I.e. while one thread is keeping the mutex locked, the operating system continues to do context switches letting other threads run also, but if any other thread is trying to lock the same mutex its execution will be halted until the mutex is unlocked.
So yes, you can indeed have several different mutexes and lock/unlock them independently. Just beware of deadlocks, i.e. if one thread can lock more than one mutex at a time you can run into a situation where thread 1 has locked mutex A and is trying to lock mutex B but blocks because thread 2 already has mutex B locked and it is trying to lock mutex A..
Its not completely clear that your use case is:
the threads gets a buffer assigned on that they have to work
the threads have some results and request a special buffer to update.
On the first variant you need some assignment logic that assigns a buffer to a thread.
This logic has to be exectued in an atomic way. so the best is to use a mutex to protect the assignment logic.
On the other variant it may be the best to have a vector of mutexes, one for each buffer element.
In Both cases the buffer does not need a protection because it (or better each field of it) is only accessed from one thread at a time.
You also may inform yourself about 'semaphores'. These contain a counter that allows to manage ressources that have a limited amount but more than one. Mutexes are a special case of semaphores with n=1.
You can have mutex per entry, C++11 mutex can be easily converted into an adaptive-spinlock, so you can achieve good CPU/Latency tradeoff.
Or, if you need very low latency yet have enough CPUs you can use an atomic "busy" flag per entry and spin in a tight compare-exchange loop on contention.
From experience, though, the best performance and scalability are achieved when concurrent writes are serialized via a command queue (or a queue of smaller immutable buffers to be concatenated at destination) and a single thread processing the queue.

Thread synchronization in qt

I have a program that have 3 threads.All of them take data from ethernet on different ports.The frequencies of the data coming for 3 of the threads may be different. But all of the incoming data must be processed at the same time.
So if one data comes for one thread, it must wait the others data to come. How can I get it?
Boost.Thread has a barrier class, whose purpose is to block multiple threads until a specified number have reached the barrier.
You would just create a boost::barrier initialized with 3, meaning that it blocks until three threads are waiting on the barrier. When each of your threads is done waiting for data, you have them call wait() on the barrier. When the third thread calls wait(), all three threads will continue execution.
boost::barrier barrier(3);
void thread_function()
barrier.wait(); // Threads will block here until all three are ready.
If you only want one thread to process the data, you can check the return value of wait(); the function will only return true for one of the threads at the barrier.
You need a barrier. Barrier has preset capacity N and blocks N-1 threads until N-th arrives. After the N-th arrives, all N threads are released simultaneously.
Unfortunately Qt has no direct support for barriers, but there is simple implementation using Qt primitives here: https://stackoverflow.com/a/9639624/1854587
Not as simple as boost's barrier as answered by #dauphic, but this can be done with Qt alone, using slots, signals and another class on a 4th thread.
Create a class on a separate thread that coordinates the other 3, the network threads can send a signal to the 'coordinator' class when they receive data. Once this coordinator class has received messages from all 3 network threads, it can then signal the threads to process the data.

Idiom or pattern for N concurrent readers and 1 producer

Using the C++11 standard library (with the only help of boost::thread eventually) is there a clean way to implement a N readers - 1 producer solution, where all the readers, once notified at the same time (with std::condition_variable::notify_all() for example) by the producer, are guaranteed to enter their critical section before the producer will eventually enter its critical section a second time. In other words, all the notified readers must observe the same state of the shared resource. Once the producer noties the N readers, it cannot modify the shared resource until all the N readers have finished their reading. Note that boost::barrier is not really what I need, as I do not know N in advance. N may vary from one notification to another.
You could use atomic counters, with some polling from the producer thread.
When the counter reaches either N or 0 (it's up to you) then the producer gets to work and produce whatever it needs to produce. Before notifying the condition variable, the producers sets the counter to 0 (or N).
When a reader is done, it simply increases (or decreases) the counter.
What you describe is called a barrier

Keep Track of Reference to Data ( How Many / Who ) in Multithreading

I came across a problem in multithreading, Model of multithreading is 1 Producer - N Consumer.
Producer produces the data (character data around 200bytes each), put it in fixed size cache ( i.e 2Mil). The data is not relevent to all the threads. It apply the filter ( configured ) and determines no of threads qualify for the produced data.
Producer pushes the pointer to data into the queue of qualifying threads ( only pointer to the data to avoid data copy). Threads will deque and send it over TCP/IP to their clients.
Problem: Because of only pointer to data is given to multiple threads, When cache becomes full, Produces wants to delete the first item(old one). possibility of any thread still referring to the data.
Feasible Way : Use Atomic granularity, When producer determines the number of qualifying threads, It can update the counter and list of thread ids.
class InUseCounter
int m_count;
set<thread_t> m_in_use_threads;
Mutex m_mutex;
Condition m_cond;
// This constructor used by Producer
InUseCounter(int count, set<thread_t> tlist)
m_count = count;
m_in_use_threads = tlist;
// This function is called by each threads
// When they are done with the data,
// Informing that I no longer use the reference to the data.
void decrement(thread_t tid)
Gaurd<Mutex> lock(m_mutex);
int get_count() const { return m_count; }
master chache
map<seqnum, Data>
pair<CharData, InUseCounter>
When producer removes the element it checks the counter, is more than 0, it sends action to release the reference to threads in m_in_use_threads set.
If there are 2Mil records in master cache, there will be equal
number of InUseCounter, so the Mutex varibles, Is this advisable to have 2Mil mutex varible in one single process.
Having big single data structure to maintain the InUseCounter will
cause more locking time to find and decrement
What would be the best alternative to my approach to find out the references, and who
all have the references with very less locking time.
Advance thanks for you advices.
2 million mutexes is a bit much. Even if they are lightweight locks,
they still take up some overhead.
Putting the InUseCounter in a single structure would end up involving contention between threads when they release a record; if the threads do not execute in lockstep, this might be negligible. If they are frequently releasing records and the contention rate goes up, this is obviously a performance sink.
You can improve performance by having one thread responsible for maintaining the record reference counts (the producer thread) and having the other threads send back record release events over a separate queue, in effect, turning the producer into a record release event consumer. When you need to flush an entry, process all the release queues first, then run your release logic. You will have some latency to deal with, as you are now queueing up release events instead of attempting to process them immediately, but the performance should be much better.
Incidentally, this is similar to how the Disruptor framework works. It's a high performance Java(!) concurrency framework for high frequency trading. Yes, I did say high performance Java and concurrency in the same sentence. There is a lot of valuable insight into high performance concurrency design and implementation.
Since you already have a Producer->Consumer queue, one very simple system consists in having a "feedback" queue (Consumer->Producer).
After having consumed an item, the consumer feeds the pointer back to the Producer so that the Producer can remove the item and updates the "free-list" of the cache.
This way, only the Producer ever touches the cache innards, and no synchronization is necessary there: only the queues need be synchronized.
Yes, 2000000 mutexes are an overkill.
1 big structure will be locked longer, but will require much less lock/unlocks.
the best approach would be to use shared_ptr smart pointers: they seem to be tailor made for this. You don't check the counter yourself, you just clean up your pointer. shared_ptr is thread-safe, not the data it points to, but for 1 producer (writer) / N consumer (readers), this should not be an issue.