Increasing concurrency in producer consumer - c++

I am solving a specific kind of producer-consumer which goes as follows -
There is a buffer of size n. Consumers take items from buffer, one item at a time (By this I mean, whenever a consumer thread has access to buffer, it won't take more than one item). Whenever the buffer is empty, a call to producer must be raised. The producer completely fills in this buffer and then blocks itself until a call is made again. I have modelled each producer and consumer as a thread and implemented it this way -
bool buffer[n];
//Producer
while(true){
lock(bufferLock);
wait(producerSemaphore,bufferLock);
completelyFillbuffer();
signalAll(consumerSemaphore);
unlock(bufferLock);
}
//Consumer
while(true){
lock(bufferLock);
if(buffer.isEmpty()){
signal(producerSemaphore);
wait(consumerSemaphore,bufferLock);
}
takeSliceFrombuffer();
unlock(bufferLock);
}
takeItemFrombuffer(){
take any true item and make it false;
}
completelyFillbuffer(){
make all items true;
}
The problem is that, I am using a single lock for complete buffer. So at any point, only a single consumer can take an item. But when the buffer is of large size, it makes sense to allow more consumers to take items simultaneously. How do I implement this?

I think you are able to safely remove items from the buffer of safely mark them false. Just make this operation atomic. For instance:
// consumer
getTheItem(buffer);
if( item != false)
checkAndChange(item);
checkAndChange(item):
if( item != false)
atomicChange(item)
And you can lock the buffer only for producer. Another way is to use lock-free structures.

You can't; if you do that then how will you know that multiple consumers do not take the same element? You do need a single lock for the single producer buffer in order to safely remove one item at a time (serially). So, you can't parallelize the fetching of items from that queue/buffer, but you can parallelize the processing of the values.
// consumer
while (true) {
item = fetchItemFromBuffer();
process( item );
}
fetchItemFromBuffer() {
lock(bufferLock);
while (buffer.isEmpty()) {
signal(producerSemaphore);
wait(consumerSemaphore,bufferLock);
}
item = buffer.remove( 0 );
unlock(bufferLock);
return item;
}

A relatively simple idea would be to split the buffer into chunks.
For example, let's say you have a buffer of size 1024. You could split it into 64 chunks of size 16 each, or in any other way that suits your needs.
You will then need a mutex for each chunk. Each consumer then decides which element it wants to remove and then proceeds to lock the appropriate chunk. However, it may need to re-select and lock other chunks, if the initially selected chunk only has false values.
Another approach is lock-free programming, but it depends on how far you want to go into this. A good introduction is here: http://preshing.com/20120612/an-introduction-to-lock-free-programming/

Related

How to get size of boost SPSC Queue?

We would like to know the number of element in the queue at a given point of time.
We are pushing and popping objects, we would like to know the number of object in the Queue buffer.
Is there any inbuilt function for this ?
Or some other way to get it ?
http://www.boost.org/doc/libs/1_53_0/doc/html/boost/lockfree/spsc_queue.html
You can't reliably get the size because it invites race conditions. For the same reason, you won't find the empty() method: by the time the method returned a value, it will be irrelevant, because it might have changed.
Sometimes lockfree containers provide an "unreliable_size()" method (for purposes of statistics/logging)
The special case here is that SPSC assume single producer and consumers:
size_type read_available() const;
number of available elements that can be popped from the spsc_queue
size_type write_available() const;
get write space to write elements
Note these are only valid when used from the respective consumer/producer thread.
Looks like our operations are limited to the pop() and push() functions. In your software design, you have to focus on these operation. For example, if you are the consumer, you are limited to consume all of the items on the queue one at a time. And you have to rely on another channel of communication with the producer (conditional variable or atomic variable).
atomic<bool> producer_done(false); // producer set this variable to tell the consumer the status
spsc_queue<Obj> theQ; // assume producers have pushed
Obj tmpObj;
while (!producer.done) {
if (!theQ.pop(tmpObj)) {
cerr << "did not get any item from the producer\n";
// the producer may be too slow, your only choice is loop and wait, or other more complicated inter thread communication
// may be sleep a little
this_thread::sleep_for(1s);
}
else { // you got an item to work on
consume(tmpObj);
}
}
// now you know the single producer is no longer adding item to the queue
while (theQ.pop(tmpObj)) {
consume(tmpObj);
}
This is essentially the coding patter you can use with the spsc_queue at the consumer part.

concurrent message processing ordered chronologically

I want to optimize a message decoder written in C++ in terms of performance. The decoder is designed completely sequentially. The concept for the actual parallelization is kind of simple:
As soon as new data arrives on a certain socket, tell a thread-pool to run another thread that will decode the received message.
At the end of each thread, a method will be invoked (namely a Qt signal will be emitted) and an object created during processing will be passed.
My problem is: length and complexity of the processed messages vary, such that the order in which threads finish might differ from the order that the messages have been received. In other words, I need to serialize in place without the use of a threadsafe container.
How can I make sure that threads, as soon as they finish, call the method mentioned above in the correct chronological order without queueing them in a threadsafe container?
My first idea was to create as many mutexes as there are threads in the thread-pool and then use each mutex to send a "finished"-signal from an older thread to a newer thread.
Any comments appreciated!
If you really don't want to use a data structure like a priority_queue or a sequence of pre-reserved buffers and block your threads instead, you can do the following:
Pair your message with an index that indicates its original
position and pass it on to the thread pool.
Use a common (e.g. global, atomic) counter variable that indicates the last processed message.
Let each thread wait until this variable indicates that the previous message has been processed.
Pass on the produced object and increase the counter
The code would look something like this:
struct MsgIndexed {
size_t idx;
Msg msg;
};
//Single thread that receives all messages sequentially
void threadReceive() {
for (size_t i = 1; true ; i++)
{
Msg m = readMsg();
dipatchMsg(MsgIndexed{i,m});
}
}
std::atomic<size_t> cnt=0;
//multiple worker threads that work in parallel
void threadWork() {
while (1) {
MsgIndexed msg = waitforMsg();
Obj obj = processMsg(msg.msg);
//Just for demonstration purposes.
//You probably don't want to use a spinlock here, but e.g. a condition variable instead
while (cnt != (msg.idx - 1u)) { std::this_thread::yield(); }
forwardObj(obj);
cnt++;
}
}
Just be aware that this is a quite inefficent solution, as your workerthreads still have to wait around after they are done with their actual work.

Concurrency issue with std::map insert/erase

I'm writing a threaded application that will process a list of resources and may or may not place a resulting item in a container (std::map) for each resource.
The processing of resources takes place in multiple threads.
The result container will be traversed and each item acted upon by a seperate thread which takes an item and updates a MySQL database (using mysqlcppconn API), then removes the item from the container and continues.
For simplicities sake, here's the overview of the logic:
queueWorker() - thread
getResourcesList() - seeds the global queue
databaseWorker() - thread
commitProcessedResources() - commits results to a database every n seconds
processResources() - thread x <# of processor cores>
processResource()
queueResultItem()
And the pseudo-implementation to show what I'm doing.
/* not the actual stucts, but just for simplicities sake */
struct queue_item_t {
int id;
string hash;
string text;
};
struct result_item_t {
string hash; // hexadecimal sha1 digest
int state;
}
std::map< string, queue_item_t > queue;
std::map< string, result_item_t > results;
bool processResource (queue_item_t *item)
{
result_item_t result;
if (some_stuff_that_doesnt_apply_to_all_resources)
{
result.hash = item->hash;
result.state = 1;
/* PROBLEM IS HERE */
queueResultItem(result);
}
}
void commitProcessedResources ()
{
pthread_mutex_lock(&resultQueueMutex);
// this can take a while since there
for (std::map< string, result_item_t >::iterator it = results.begin; it != results.end();)
{
// do mysql stuff that takes a while
results.erase(it++);
}
pthread_mutex_unlock(&resultQueueMutex);
}
void queueResultItem (result_item_t result)
{
pthread_mutex_lock(&resultQueueMutex);
results.insert(make_pair(result.hash, result));
pthread_mutex_unlock(&resultQueueMutex);
}
As indicated in processResource(), the problem is there and is that when commitProcessedResources() is running and resultQueueMutex is locked, we'll wait here for queueResultItem() to return since it'll try to lock the same mutex and therefore will wait until it's done, which might take a while.
Since there, obviously, is a limited number of threads running, as soon as all of them are waiting for queueResultItem() to finish, no more work will be done until the mutex is released and usable for queueResultItem().
So, my question is how I best go about implementing this? Is there a specific kind of standard container that can be inserted into and deleted from simultaneously or does there exist something that I just don't know of?
It is not strictly necessary that each queue item can have it's own unique key as is the case here with the std::map, but I would prefer it since several resources can produce the same result and I would prefer to only send a unique result to the database even if it does use INSERT IGNORE to ignore any duplicates.
I'm fairly new to C++ so I've no idea what to look for on Google, unfortunately. :(
You do not have to hold the lock for the queue all the time during processing in commitProcessedResources (). You can instead swap the queue with empty one:
void commitProcessedResources ()
{
std::map< string, result_item_t > queue2;
pthread_mutex_lock(&resultQueueMutex);
// XXX Do a quick swap.
queue2.swap (results);
pthread_mutex_unlock(&resultQueueMutex);
// this can take a while since there
for (std::map< string, result_item_t >::iterator it = queue2.begin();
it != queue2.end();)
{
// do mysql stuff that takes a while
// XXX You do not need this.
//results.erase(it++);
}
}
You will need to use synchronization methods (i.e. the mutex) to make this work properly. However, the goal of parallel programming is to minimize the critical section (i.e. the amount of code which is executed while you hold the lock).
That said, if your MySQL queries can be run in parallel without synchronization (i.e. multiple calls won't conflict with each other), take them out of the critical section. This will greatly reduce overhead. For instance, a simple refactor as follow could do the trick
void commitProcessedResources ()
{
// MOVING THIS LOCK
// this can take a while since there
pthread_mutex_lock(&resultQueueMutex);
std::map<string, result_item_t>::iterator end = results.end();
std::map<string, result_item_t>::iterator begin = results.begin();
pthread_mutex_unlock(&resultQueueMutex);
for (std::map< string, result_item_t >::iterator it = begin; it != end;)
{
// do mysql stuff that takes a while
pthread_mutex_lock(&resultQueueMutex); // Is this the only place we need it?
// This is a MUCH smaller critical section
results.erase(it++);
pthread_mutex_unlock(&resultQueueMutex); // Unlock or everything will block until end of loop
}
// MOVED UNLOCK
}
This will give you concurrent "real-time" access to the data across multiple threads. That is, as every write finishes, the map is updated and can be read elsewhere with current information.
Up through C++03, the standard didn't define anything about threading or thread safety at all (and since you're using pthreads, I'm guess that's pretty much what you're using).
As such, it's up to you to do locking on your shared map, to ensure that only one thread tries to access the map at any given time. Without that, you're likely to corrupt its internal data structure, so the map is no longer valid at all.
Alternatively (and I'd generally prefer this) you could have your multiple thread just put their data into a thread-safe queue, and have a single thread that gets data from that queue and puts it into the map. Since it's single-threaded, you no longer have to lock the map when its in use.
There are a few reasonable possibilities for dealing with the delay while you flush the map to the disk. Probably the simplest is to have the same thread read from the queue, insert into the map, and periodically flush the map to disk. In this case, the incoming data just sits in the queue while the map is being flushed to disk. This keeps access to the map simple -- since only one thread ever touches it directly, it can use the map without any locking.
Another would be to have two maps. At any given time, the thread that flushes to disk gets one map, and the thread the retrieves from the queue and inserts into the map gets the other. When the flushing thread needs to do its thing, it just swaps the roles of the two. Personally, I think I prefer the first though -- eliminating all the locking around the map has a great deal of appeal, at least to me.
Yet another variant that would maintain that simplicity would be for the queue->map thread to create map, fill it, and when it's full enough (i.e., after the appropriate length of time) stuff it into another queue, then repeat from the start (i.e., create new map, etc.) The flushing thread retrieves a map from its incoming queue, flushes it to disk, and destroys it. Though this adds a bit of overhead creating and destroying maps, you're not doing it often enough to care a lot. You still keep single-threaded access to any map at any time, and still keep all the database access segregated from everything else.

Keep Track of Reference to Data ( How Many / Who ) in Multithreading

I came across a problem in multithreading, Model of multithreading is 1 Producer - N Consumer.
Producer produces the data (character data around 200bytes each), put it in fixed size cache ( i.e 2Mil). The data is not relevent to all the threads. It apply the filter ( configured ) and determines no of threads qualify for the produced data.
Producer pushes the pointer to data into the queue of qualifying threads ( only pointer to the data to avoid data copy). Threads will deque and send it over TCP/IP to their clients.
Problem: Because of only pointer to data is given to multiple threads, When cache becomes full, Produces wants to delete the first item(old one). possibility of any thread still referring to the data.
Feasible Way : Use Atomic granularity, When producer determines the number of qualifying threads, It can update the counter and list of thread ids.
class InUseCounter
{
int m_count;
set<thread_t> m_in_use_threads;
Mutex m_mutex;
Condition m_cond;
public:
// This constructor used by Producer
InUseCounter(int count, set<thread_t> tlist)
{
m_count = count;
m_in_use_threads = tlist;
}
// This function is called by each threads
// When they are done with the data,
// Informing that I no longer use the reference to the data.
void decrement(thread_t tid)
{
Gaurd<Mutex> lock(m_mutex);
--m_count;
m_in_use_threads.erease(tid);
}
int get_count() const { return m_count; }
};
master chache
map<seqnum, Data>
|
v
pair<CharData, InUseCounter>
When producer removes the element it checks the counter, is more than 0, it sends action to release the reference to threads in m_in_use_threads set.
Question
If there are 2Mil records in master cache, there will be equal
number of InUseCounter, so the Mutex varibles, Is this advisable to have 2Mil mutex varible in one single process.
Having big single data structure to maintain the InUseCounter will
cause more locking time to find and decrement
What would be the best alternative to my approach to find out the references, and who
all have the references with very less locking time.
Advance thanks for you advices.
2 million mutexes is a bit much. Even if they are lightweight locks,
they still take up some overhead.
Putting the InUseCounter in a single structure would end up involving contention between threads when they release a record; if the threads do not execute in lockstep, this might be negligible. If they are frequently releasing records and the contention rate goes up, this is obviously a performance sink.
You can improve performance by having one thread responsible for maintaining the record reference counts (the producer thread) and having the other threads send back record release events over a separate queue, in effect, turning the producer into a record release event consumer. When you need to flush an entry, process all the release queues first, then run your release logic. You will have some latency to deal with, as you are now queueing up release events instead of attempting to process them immediately, but the performance should be much better.
Incidentally, this is similar to how the Disruptor framework works. It's a high performance Java(!) concurrency framework for high frequency trading. Yes, I did say high performance Java and concurrency in the same sentence. There is a lot of valuable insight into high performance concurrency design and implementation.
Since you already have a Producer->Consumer queue, one very simple system consists in having a "feedback" queue (Consumer->Producer).
After having consumed an item, the consumer feeds the pointer back to the Producer so that the Producer can remove the item and updates the "free-list" of the cache.
This way, only the Producer ever touches the cache innards, and no synchronization is necessary there: only the queues need be synchronized.
Yes, 2000000 mutexes are an overkill.
1 big structure will be locked longer, but will require much less lock/unlocks.
the best approach would be to use shared_ptr smart pointers: they seem to be tailor made for this. You don't check the counter yourself, you just clean up your pointer. shared_ptr is thread-safe, not the data it points to, but for 1 producer (writer) / N consumer (readers), this should not be an issue.

What is better for a message queue? mutex & cond or mutex&semaphore?

I am implementing a C++ message queue based on a std::queue.
As I need popers to wait on an empty queue I was considering using mutex for mutual exclusion and cond for suspending threads on empty queue, as glib does with the gasyncqueue.
However it looks to me that a mutex&semaphore would do the job, I think it contains an integer and that seems like a pretty high number to reach on pending messages.
Pros of semaphore are that you don't need to check manually the condition each time you return from a wait, as you now for sure that someone inserted something(when someone inserted 2 items and you are the second thread arriving).
Which one would you choose?
EDIT:
Changed the question in response to #Greg Rogers
A single semaphore does not do the job - you need to be comparing (mutex + semaphore) and (mutex + condition variable).
It is pretty easy to see this by trying to implement it:
void push(T t)
{
queue.push(t);
sem.post();
}
T pop()
{
sem.wait();
T t = queue.top();
queue.pop();
return t;
}
As you can see there is no mutual exclusion when you are actually reading/writing to the queue, even though the signalling (from the semaphore) is there. Multiple threads can call push at the same time and break the queue, or multiple threads could call pop at the same time and break it. Or, a thread could call pop and be removing the first element of the queue while another thread called push.
You should use whichever you think is easier to implement, I doubt performance will vary much if any (it might be interesting to measure though).
Personally I use a mutex to serialize access to the list, and wake up the consumer by sending a byte over a socket (produced by socketpair()). That may be somewhat less efficient than a semaphore or condition variable, but it has the advantage of allowing the consumer to block in select()/poll(). That way the consumer can also wait on other things besides the data queue, if it wants to. It also lets you use the exact same queueing code on almost all OS's, since practically every OS supports the BSD sockets API.
Psuedocode follows:
// Called by the producer. Adds a data item to the queue, and sends a byte
// on the socket to notify the consumer, if necessary
void PushToQueue(const DataItem & di)
{
mutex.Lock();
bool sendSignal = (queue.size() == 0);
queue.push_back(di);
mutex.Unlock();
if (sendSignal) producerSocket.SendAByteNonBlocking();
}
// Called by consumer after consumerSocket selects as ready-for-read
// Returns true if (di) was written to, or false if there wasn't anything to read after all
// Consumer should call this in a loop until it returns false, and then
// go back to sleep inside select() to wait for further data from the producer
bool PopFromQueue(DataItem & di)
{
consumerSocket.ReadAsManyBytesAsPossibleWithoutBlockingAndThrowThemAway();
mutex.Lock();
bool ret = (queue.size() > 0);
if (ret) queue.pop_front(di);
mutex.Unlock();
return ret;
}
If You want to allow multiple simultaneously users at a time to use your queue, you should use semaphores.
sema(10) // ten threads/process have the concurrent access.
sema_lock(&sema_obj)
queue
sema_unlock(&sema_obj)
Mutex will "authorize" only one user at a time.
pthread_mutex_lock(&mutex_obj)
global_data;
pthread_mutex_unlock(&mutex_obj)
That's the main difference and You should decide which solution will fit your requirements.
But I'd choose mutex approach, because You don't need to specifies how many users can grab your resource.