Multi consumer dequeue in DPDK framework

Multi consumer dequeue in DPDK framework - dpdk

Is it possible to implement a ring buffer in DPDK in which a single object is enqueued by a single producer can be dequeued by multiple consumers(say 4 consumers)?. i.e.,..Is the object dequeued by the 1st consumer in the ring buffer available to another consumers?

DPDK rings are just pointers to buffers, the enqueue/dequeue operations are quite cheap. So the best solution comes to mind is to create 4 queues and enqueue the same object 4 times to 4 different queues.
There might be issues with freeing this object. Have a look into the mbuf reference counter.
And there might be issues with simultaneous modification of the object. Have a look at locks or other synchronization mechanisms.

Related

Make datastream from thread readable for all other threads

I have to distribute a data stream under clients of a multithreaded server instance, the client threads do only need to read. That means I have a thread from which the data comes and all other threads need to read that data (they do not have to change it anymore) so that they can send the data to the clients.
I tried a thread safe queue (https://blog.chrisd.info/a-simple-thread-safe-queue-for-use-in-multi-threaded-c-applications/) but as soon as I tried it with more than one client only the second or the new one received the data.
How do I solve the problem? Are there any thread safe queues that can be used in multiple threads?
Luick

As from what you described, the usual queue semantics won't work, since you actually want to pop the elements when all the threads have gotten it, not on the first access. So you have several options:
Maintain a queue per each client thread, and the producer thread always pushes the data into each of the client threads. By wrapping the data into an std::shared_ptr you could reduce memory overhead and create semantics, where the data is destroyed when the last client is done with it.
Have a single queue but multiple tail pointers for each thread. Although this can get complex in terms of handling the threads as they spawn/terminate. But you haven't stated what the constraints are in your system - is the thread count fixed or dynamic.

Multirate threads

I ran recently into a requirement in which there is a need for multithreaded application whose threads run at different rates.
The questions then become, since i am still learning multithreading:
A scenario is given to put things into perspective:
Say 1st thread runs at 100 Hz "real time"
2nd runs at 10 Hz
and say that the 1st thread provides data "myData" to the 2nd thread.
How is myData going to be provided to the 2nd thread, is the common practice to just read whatever is available from the first thread, or there need to be some kind of decimation to reduce the rate.
Does the myData need to be some kind of Singleton with locking mechanism. Although myData isn't shared, but rather updated by the first thread and used in the second thread.
How about the opposite case, when the data used in one thread need to be used at higher rate in a different thread.

How is myData going to be provided to the 2nd thread
One common method is to provide a FIFO queue -- this could be a std::dequeue or a linked list, or whatever -- and have the producer thread push data items onto one end of the queue while the consumer thread pops the data items off of the other end of the queue. Be sure to serialize all accesses to the FIFO queue (using a mutex or similar locking mechanism), to avoid race conditions.
Alternatively, instead of a queue you could have a single shared data object (essentially a queue of length one) and have your producer thread overwrite the object every time it generates new data. This could be done in cases where it's not important that the consumer thread sees every piece of data that was generated, but rather it's only important that it sees the most recent data. You'd still need to do the locking, though, to avoid the risk of the consumer thread reading from the data object at the same time the producer thread is in the middle of writing to it.
or does there need to be some kind of decimation to reduce the rate.
There doesn't need to be any decimation -- the second thread can just read in as much data as there is available to read, whenever it wakes up.
Does the myData need to be some kind of Singleton with locking
mechanism.
Singleton isn't necessary (although it's possible to do it that way). The locking mechanism is necessary, unless you have some kind of lock-free synchronization mechanism (and if you're asking this level of question, you don't have one and you don't want to try to get one either -- keep things simple for now!)
How about the opposite case, when the data used in one thread need to
be used at higher rate in a different thread.
It's the same -- if you're using a proper inter-thread communications mechanism, the rates at which the threads wake up doesn't matter, because the communications mechanism will do the right thing regardless of when or how often the the threads wake up.

Any multithreaded program has to cope with the possibility that one of the threads will work faster than another - by any ratio - even if they're executing on the same CPU with the same clock frequency.
Your choices include:
producer-consumer container than lets the first thread enqueue data, and the second thread "pop" it off for processing: you could let the queue grow as large as memory allows, or put some limit on the size after which either data would be lost or the 1st thread would be forced to slow down and wait to enqueue further values
there are libraries available (e.g. boost), or if you want to implement it yourself google some tutorials/docs on mutex and condition variables
do something conceptually similar to the above but where the size limit is 1 so there's just the single myData variable rather than a "container" - but all the synchronisation and delay choices remain the same
The Singleton pattern is orthogonal to your needs here: the two threads do need to know where the data is, but that would normally be done using e.g. a pointer argument to the function(s) run in the threads. Singleton's easily overused and best avoided unless reasons stack up high....

In an LMAX disruptor like pattern, how do you handle a slow consumer?

I have a question on what to do in a case of a slow consumer in a lmax disruptor like ring buffer that has multi producers and a single consumer running on x86 Linux. With an lmax like ring buffer pattern you are constantly overwriting data but what if the consumer is slow. Therefore how do you handle the case where say in a 10 sized ring buffer 0-9 ring slots your consumer is on slot 5 and now your writers are ready to start writing slot 15, which is also slot 5 in the buffer ( ie: slot 5 = 15 % 10 )? What is the typical way to handle this such that writers still produce data in order it came in and clients will receive the data in the same order? That's really my question. Below are some details about my design and it works fine it's just I currently don't have a good way to handle this issue. There are multiple threads doing writes and a single thread doing reads I can't introduce multiple reader threads without changing the existing design which is beyond the current project scope currently, but interested still in your thoughts still if they involve this as a solution.
Design specifics
I have a ring buffer and the design currently has multiple producers threads and a single consumer thread. This part of the design is existing and cannot currently change . I am trying to remove the existing queue-ing system using a lock free ring buffer. What I have is as follows.
The code runs on x86 Linux, there are multiple threads running for writers and there is a single thread for the reader. The reader and writer start one slot apart and are std::atomic<uint64_t>, so the reader starts at slot 0 and writer at slot 1 then each writer will first claim a slot by first doing an atomic fetch_add(1, std::memory_order::memory_order_acq_rel) on the writer sequence by calling incrementSequence shown below and afterwards use a compare_and_swap loop to update the reader sequence to let clients know this slot is available see updateSequence .
inline data_type incrementSequence() {
return m_sequence.fetch_add(1,std::memory_order::memory_order_seq_cst);
}
void updateSequence(data_type aOld, data_type aNew) {
while ( !m_sequence.compare_exchange_weak(aOld, aNew, std::memory_order::memory_order_release, std::memory_order_relaxed)
if (sequence() < aNew) {
continue;
}
break;
}
}
inline data_type sequence() const {
return m_sequence.load(std::memory_order::memory_order_acquire);
}

A ring buffer (or a FIFO in general--doesn't have to be implemented as a ring buffer) is intended to smooth out bursts of traffic. Even though producers may produce the data in bursts, the consumers can deal with a steady flow of input.
If you're overflowing the FIFO, it means one of two things:
Your bursts are larger than you planned for. Fix this by increasing the FIFO size (or making its size dynamic).
Your producers are out-running your consumers. Fix this by increasing the resources devoted to consuming the data (probably more threads).
It sounds to me like you're currently hitting the second: your single consumer simply isn't fast enough to keep up with the producers. The only real choices in that case are to speed up consumption by either optimizing the single consumer, or adding more consumers.
It also sounds a bit as if your consumer may be leaving their input data in the FIFO while they process the data, so that spot in the FIFO remains occupied until the consumer finishes processing that input. If so, you may be able to fix your problem by simply having the consumer remove the input data from the FIFO as soon as it starts processing. This frees up that slot so the producers can continue placing input into the buffer.
One other point: making the FIFO size dynamic can be something of a problem. The problem is fairly simple: it can cover up the fact that you really have the second problem of not having the resources necessary to process the data on the consumer side.
Assuming both the producers and the consumers are thread pools, the easiest way to balance the system is often to use a fixed-size FIFO. If the producers start to get so far ahead of the consumers that the FIFO overflows, then producers start to block. This lets the consumer thread pool consume more computational resources (e.g., run on more cores) to catch back up with the producers. This does, however, depend on being able to add more consumers, not restricting the system to a single consumer.

Dealing with boost threads race conditions in C++

I have 6 threads running in my application continuously. The scenario is:
One thread continuously gets the messages and inserts into a message queue. Other 4 threads can be considered as workers which continuously fetch messages from queue and process them. The other final thread populates the analytics information.
Problem:
Now the sleep durations for getting messages thread is 100ms. Worker threads is 200ms. When I ran this application the messages fetch thread is taking control and inserting into the queue thus increasing the heap. The worker threads are not getting chance to process the messages and deallocate them. Finally its resulting into out of memory.
How to manage this kind of scenario so that equal opportunity is given for messages fetch thread and worker thread.
Thanks in advance :)

You need to add back-pressure to your producer thread. Usually this will done by using blocking consumer-producer queues. Producer adds items to queue, consumers dequeues items from queue and process them. If queue is empty, consumers blocks until producer adds something to queue. If queue is full producer blocks until consumers fetch items from the queue.

One system of flow-control that I use often is to create a large pool of message objects at startup and never create any more. The *objects are stored on a thread-safe, blocking 'pool queue' and circulated around, popped from the pool by producer/s, queued to consumer/s on other blocking queues and then pushed back onto the pool queue when 'consumed'.
This caps memory use, provides flow-control, (if the pool empties, the producer/s block on it until messages are returned from consumers), and eliminates continual new/delete/malloc/free. The more complex and slower bounded queues are not necessary and all queues need only to be large enough to hold the, (known), maximum number of messages.
Using 'classic' blocking queues does not require any Sleep() calls.

Your question is a little vague so I can give you these guidelines instead of a code:
Protect mutual data with Mutex. In a multi-threaded consumer producer problem usually there is a race condition on the mutual data (the message in your program). One thread is attempting to write on the mutual memory location while the other is trying to read from the same location. The message read by the reader might be corrupted because the writer has wrote over it in the middle of reading process. You can lock the mutual memory location with a Mutex. Each one of the threads should acquire this lock in order to be able to read or modify the mutual data. This way the consumer process will be absolutely sure that data has not been modified. However you should note that acquiring this lock might hold back the producer thread so you should release the lock as soon as possible.
Use condition variables to notify consumer threads. If you do not use a notification mechanisms all consumer threads should actively check for data production which will use up system resources. The consumer threads should easily go to sleep knowing that the producer thread will notify them whenever a message is ready.
The threading library in C++ 11 has everything you need to implement a consumer producer application. However if you are not able to upgrade your compiler you could use boost threading library as well.

You want to use a bounded queue which when full will block threads trying to enqueue until more space is available.
You can use concurrent_bounded_queue from tbb, or simply use a semaphore initialized to the maximum queue size, and decrement on enqueue and increment on dequeue. boost::thread doesn't provide semaphores natively, but you can implement it using locks and condition variables.

what's the advantage of message queue over shared data in thread communication?

I read a article about multithread program design http://drdobbs.com/architecture-and-design/215900465, it says it's a best practice that "replacing shared data with asynchronous messages. As much as possible, prefer to keep each thread’s data isolated (unshared), and let threads instead communicate via asynchronous messages that pass copies of data".
What confuse me is that I don't see the difference between using shared data and message queues. I am now working on a non-gui project on windows, so let's use windows's message queues. and take a tradition producer-consumer problem as a example.
Using shared data, there would be a shared container and a lock guarding the container between the producer thread and the consumer thread. when producer output product, it first wait for the lock and then write something to the container then release the lock.
Using message queue, the producer could simply PostThreadMessage without block. and this is the async message's advantage. but I think there must exist some lock guarding the message queue between the two threads, otherwise the data will definitely corrupt. the PostThreadMessage call just hide the details. I don't know whether my guess is right but if it's true, the advantage seems no longer exist,since both two method do the same thing and the only difference is that the system hide the details when using message queues.
ps. maybe the message queue use a non-blocking containner, but I could use a concurrent container in the former way too. I want to know how the message queue is implemented and is there any performance difference bwtween the two ways?
updated:
I still don't get the concept of async message if the message queue operations are still blocked somewhere else. Correct me if my guess was wrong: when we use shared containers and locks we will block in our own thread. but when using message queues, myself's thread returned immediately, and left the blocking work to some system thread.

Message passing is useful for exchanging smaller amounts of data, because no conflicts need be avoided. It's much easier to implement than is shared memory for intercomputer communication. Also, as you've already noticed, message passing has the advantage that application developers don't need to worry about the details of protections like shared memory.
Shared memory allows maximum speed and convenience of communication, as it can be done at memory speeds when within a computer. Shared memory is usually faster than message passing, as message-passing are typically implemented using system calls and thus require the more time-consuming tasks of kernel intervention. In contrast, in shared-memory systems, system calls are required only to establish shared-memory regions. Once established, all access are treated as normal memory accesses w/o extra assistance from the kernel.
Edit: One case that you might want implement your own queue is that there are lots of messages to be produced and consumed, e.g., a logging system. With the implemenetation of PostThreadMessage, its queue capacity is fixed. Messages will most liky get lost if that capacity is exceeded.

Imagine you have 1 thread producing data,and 4 threads processing that data (presumably to make use of a multi core machine). If you have a big global pool of data you are likely to have to lock it when any of the threads needs access, potentially blocking 3 other threads. As you add more processing threads you increase the chance of a lock having to wait and increase how many things might have to wait. Eventually adding more threads achieves nothing because all you do is spend more time blocking.
If instead you have one thread sending messages into message queues, one for each consumer thread then they can't block each other. You stil have to lock the queue between the producer and consumer threads but as you have a separate queue for each thread you have a separate lock and each thread can't block all the others waiting for data.
If you suddenly get a 32 core machine you can add 20 more processing threads (and queues) and expect that performance will scale fairly linearly unlike the first case where the new threads will just run into each other all the time.

I have used a shared memory model where the pointers to the shared memory are managed in a message queue with careful locking. In a sense, this is a hybrid between a message queue and shared memory. This is very when large quantities of data must be passed between threads while retaining the safety of the message queue.
The entire queue can be packaged in a single C++ class with appropriate locking and the like. The key is that the queue owns the shared storage and takes care of the locking. Producers acquire a lock for input to the queue and receive a pointer to the next available storage chunk (usually an object of some sort), populates it and releases it. The consumer will block until the next shared object has released by the producer. It can then acquire a lock to the storage, process the data and release it back to the pool. In A suitably designed queue can perform multiple producer/multiple consumer operations with great efficiency. Think a Java thread safe (java.util.concurrent.BlockingQueue) semantics but for pointers to storage.

Of course there is "shared data" when you pass messages. After all, the message itself is some sort of data. However, the important distinction is when you pass a message, the consumer will receive a copy.
the PostThreadMessage call just hide the details
Yes, it does, but being a WINAPI call, you can be reasonably sure that it does it right.
I still don't get the concept of async message if the message queue operations are still blocked somewhere else.
The advantage is more safety. You have a locking mechanism that is systematically enforced when you are passing a message. You don't even need to think about it, you can't forget to lock. Given that multi-thread bugs are some of the nastiest ones (think of race conditions), this is very important. Message passing is a higher level of abstraction built on locks.
The disadvantage is that passing large amounts of data would be probably slow. In that case, you need to use need shared memory.
For passing state (i.e. worker thread reporting progress to the GUI) the messages are the way to go.

It's quite simple (I'm amazed others wrote such length responses!):
Using a message queue system instead of 'raw' shared data means that you have to get the synchronization (locking/unlocking of resources) right only once, in a central place.
With a message-based system, you can think in higher terms of "messages" without having to worry about synchronization issues anymore. For what it's worth, it's perfectly possible that a message queue is implemented using shared data internally.

I think this is the key piece of info there: "As much as possible, prefer to keep each thread’s data isolated (unshared), and let threads instead communicate via asynchronous messages that pass copies of data". I.e. use producer-consumer :)
You can do your own message passing or use something provided by the OS. That's an implementation detail (needs to be done right ofc). The key is to avoid shared data, as in having the same region of memory modified by multiple threads. This can cause hard to find bugs, and even if the code is perfect it will eat performance because of all the locking.

I had exact the same question. After reading the answers. I feel:
in most typical use case, queue = async, shared memory (locks) = sync. Indeed, you can do a async version of shared memory, but that's more code, similar to reinvent the message passing wheel.
Less code = less bug and more time to focus on other stuff.
The pros and cons are already mentioned by previous answers so I will not repeat.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js