boost::lockfree::spsc_queue busy wait strategy. Is there a blocking pop?

boost::lockfree::spsc_queue busy wait strategy. Is there a blocking pop? - c++

So i'm using a boost::lockfree::spec_queue to communicate via two boost_threads running functors of two objects in my application.
All is fine except for the fact that the spec_queue::pop() method is non blocking. It returns True or False even if there is nothing in the queue. However my queue always seems to return True (problem #1). I think this is because i preallocate the queue.
typedef boost::lockfree::spsc_queue<q_pl, boost::lockfree::capacity<100000> > spsc_queue;
This means that to use the queue efficiently i have to busy wait constantly popping the queue using 100% cpu. Id rather not sleep for arbitrary amounts of time. I've used other queues in java which block until an object is made available. Can this be done with std:: or boost:: data structures?

A lock free queue, by definition, does not have blocking operations.
How would you synchronize on the datastructure? There is no internal lock, for obvious reasons, because that would mean all clients need to synchronize on it, making it your grandfathers locking concurrent queue.
So indeed, you will have to devise a waiting function yourself. How you do this depends on your use case, which is probably why the library doesn't supply one (disclaimer: I haven't checked and I don't claim to know the full documentation).
So what can you do:
As you already described, you can spin in a tight loop. Obviously, you'll do this if you know that your wait condition (queue non-empty) is always going to be satisfied very quickly.
Alternatively, poll the queue at a certain frequency (doing micro-sleeps in the mean time). Scheduling a good good frequency is an art: for some applications 100ms is optimal, for others, a potential 100ms wait would destroy throughput. So, vary and measure your performance indicators (don't forget about power consumption if your application is going to be deployed on many cores in a datacenter :)).
Lastly, you could arrive at a hybrid solution, spinning for a fixed number of iterations, and resorting to (increasing) interval polling if nothing arrives. This would nicely support servers applications where high loads occur in bursts.

Use a semaphore to cause the producers to sleep when the queue is full, and another semaphore to cause the consumers to sleep when the queue is empty.
when the queue is neither full nor empty, the sem_post and sem_wait operations are nonblocking (in newer kernels)
#include <semaphore.h>
template<typename lock_free_container>
class blocking_lock_free
{
public:
lock_free_queue_semaphore(size_t n) : container(n)
{
sem_init(&pop_semaphore, 0, 0);
sem_init(&push_semaphore, 0, n);
}
~lock_free_queue_semaphore()
{
sem_destroy(&pop_semaphore);
sem_destroy(&push_semaphore);
}
bool push(const lock_free_container::value_type& v)
{
sem_wait(&push_semaphore);
bool ret = container::bounded_push(v);
ASSERT(ret);
if (ret)
sem_post(&pop_semaphore);
else
sem_post(&push_semaphore); // shouldn't happen
return ret;
}
bool pop(lock_free_container::value_type& v)
{
sem_wait(&pop_semaphore);
bool ret = container::pop(v);
ASSERT(ret);
if (ret)
sem_post(&push_semaphore);
else
sem_post(&pop_semaphore); // shouldn't happen
return ret;
}
private:
lock_free_container container;
sem_t pop_semaphore;
sem_t push_semaphore;
};

Related

Using Boost.Lockfree queue is slower than using mutexes

Until now I was using std::queue in my project. I measured the average time which a specific operation on this queue requires.
The times were measured on 2 machines: My local Ubuntu VM and a remote server.
Using std::queue, the average was almost the same on both machines: ~750 microseconds.
Then I "upgraded" the std::queue to boost::lockfree::spsc_queue, so I could get rid of the mutexes protecting the queue. On my local VM I could see a huge performance gain, the average is now on 200 microseconds. On the remote machine however, the average went up to 800 microseconds, which is slower than it was before.
First I thought this might be because the remote machine might not support the lock-free implementation:
From the Boost.Lockfree page:
Not all hardware supports the same set of atomic instructions. If it is not available in hardware, it can be emulated in software using guards. However this has the obvious drawback of losing the lock-free property.
To find out if these instructions are supported, boost::lockfree::queue has a method called bool is_lock_free(void) const;.
However, boost::lockfree::spsc_queue does not have a function like this, which, for me, implies that it does not rely on the hardware and that is is always lockfree - on any machine.
What could be the reason for the performance loss?
Exmple code (Producer/Consumer)
// c++11 compiler and boost library required
#include <iostream>
#include <cstdlib>
#include <chrono>
#include <async>
#include <thread>
/* Using blocking queue:
* #include <mutex>
* #include <queue>
*/
#include <boost/lockfree/spsc_queue.hpp>
boost::lockfree::spsc_queue<int, boost::lockfree::capacity<1024>> queue;
/* Using blocking queue:
* std::queue<int> queue;
* std::mutex mutex;
*/
int main()
{
auto producer = std::async(std::launch::async, [queue /*,mutex*/]()
{
// Producing data in a random interval
while(true)
{
/* Using the blocking queue, the mutex must be locked here.
* mutex.lock();
*/
// Push random int (0-9999)
queue.push(std::rand() % 10000);
/* Using the blocking queue, the mutex must be unlocked here.
* mutex.unlock();
*/
// Sleep for random duration (0-999 microseconds)
std::this_thread::sleep_for(std::chrono::microseconds(rand() % 1000));
}
}
auto consumer = std::async(std::launch::async, [queue /*,mutex*/]()
{
// Example operation on the queue.
// Checks if 1234 was generated by the producer, returns if found.
while(true)
{
/* Using the blocking queue, the mutex must be locked here.
* mutex.lock();
*/
int value;
while(queue.pop(value)
{
if(value == 1234)
return;
}
/* Using the blocking queue, the mutex must be unlocked here.
* mutex.unlock();
*/
// Sleep for 100 microseconds
std::this_thread::sleep_for(std::chrono::microseconds(100));
}
}
consumer.get();
std::cout << "1234 was generated!" << std::endl;
return 0;
}

Lock free algorithms generally perform more poorly than lock-based algorithms. That's a key reason they're not used nearly as frequently.
The problem with lock free algorithms is that they maximize contention by allowing contending threads to continue to contend. Locks avoid contention by de-scheduling contending threads. Lock free algorithms, to a first approximation, should only be used when it's not possible to de-schedule contending threads. That only rarely applies to application-level code.
Let me give you a very extreme hypothetical. Imagine four threads are running on a typical, modern dual-core CPU. Threads A1 and A2 are manipulating collection A. Threads B1 and B2 are manipulating collection B.
First, let's imagine the collection uses locks. That will mean that if threads A1 and A2 (or B1 and B2) try to run at the same time, one of them will get blocked by the lock. So, very quickly, one A thread and one B thread will be running. These threads will run very quickly and will not contend. Any time threads try to contend, the conflicting thread will get de-scheduled. Yay.
Now, imagine the collection uses no locks. Now, threads A1 and A2 can run at the same time. This will cause constant contention. Cache lines for the collection will ping-pong between the two cores. Inter-core buses may get saturated. Performance will be awful.
Again, this is highly exaggerated. But you get the idea. You want to avoid contention, not suffer through as much of it as possible.
However, now run this thought experiment again where A1 and A2 are the only threads on the entire system. Now, the lock free collection is probably better (though you may find that it's better just to have one thread in that case!).
Almost every programmer goes through a phase where they think that locks are bad and avoiding locks makes code go faster. Eventually, they realize that it's contention that makes things slow and locks, used correctly, minimize contention.

I cannot say that the boost lockfree queue is slower in all possible cases. In my experience, the push(const T& item) is trying to make a copy. If you are constructing tmp objects and pushing on the queue, then you are hit by a performance drag. I think the library just need the overloaded version push(T&& item) to make movable object more efficient. Before the addition of the new function, you might have to use pointers, the plain type, or the smart ones offered after C++11. This is a rather limited aspect of the queue, and I only use the lockfree queue vary rarely.

Recommended pattern for a queue accessed by multiple threads...what should the worker thread do?

I have a queue of objects that is being added to by a thread A. Thread B is removing objects from the queue and processing them. There may be many threads A and many threads B.
I am using a mutex when the queue in being "push"ed to, and also when "front"ed and "pop"ped from as shown in the pseudo-code as below:
Thread A calls this to add to the queue:
void Add(object)
{
mutex->lock();
queue.push(object);
mutex->unlock();
}
Thread B processes the queue as follows:
object GetNextTargetToWorkOn()
{
object = NULL;
mutex->lock();
if (! queue.empty())
{
object = queue.front();
queue.pop();
}
mutex->unlock();
return(object);
}
void DoTheWork(int param)
{
while(true)
{
object structure;
while( (object = GetNextTargetToWorkOn()) == NULL)
boost::thread::sleep(100ms); // sleep a very short time
// do something with the object
}
}
What bothers me is the while---get object---sleep-if-no-object paradigm. While there are objects to process it is fine. But while the thread is waiting for work there are two problems
a) The while loop is whirling consuming resources
b) the sleep means wasted time is a new object comes in to be processed
Is there a better pattern to achieve the same thing?

You're using spin-waiting, a better design is to use a monitor. Read more on the details on wikipedia.
And a cross-platform solution using std::condition_variable with a good example can be found here.

a) The while loop is whirling consuming resources
b) the sleep means wasted time is a new object comes in to be processed
It has been my experience that the sleep you used actually 'fixes' both of these issues.
a) The consuming of resources is a small amount of ram, and remarkably small fraction of available cpu cycles.
b) Sleep is not a wasted time on the OS's I've worked on.
c) Sleep can affect 'reaction time' (aka latency), but has seldom been an issue (outside of interrupts.)
The time spent in sleep is likely to be several orders of magnitude longer than the time spent in this simple loop. i.e. It is not significant.
IMHO - this is an ok implementation of the 'good neighbor' policy of relinquishing the processor as soon as possible.
On my desktop, AMD64 Dual Core, Ubuntu 15.04, a semaphore enforced context switch takes ~13 us.
100 ms ==> 100,000 us .. that is 4 orders of magnitude difference, i.e. VERY insignificant.
In the 5 OS's (Linux, vxWorks, OSE, and several other embedded system OS's) I have worked on, sleep (or their equivalent) is the correct way to relinquish the processor, so that it is not blocked from running another thread while the one thread is in sleep.
Note: It is feasible that some OS's sleep might not relinquish the processor. So, you should always confirm. I've not found one. Oh, but I admit I have not looked / worked much on Windows.

boost lockfree queue performance issue calling boost::lockfree::queue push(), pop(), empty() functions [duplicate]

So i'm using a boost::lockfree::spec_queue to communicate via two boost_threads running functors of two objects in my application.
All is fine except for the fact that the spec_queue::pop() method is non blocking. It returns True or False even if there is nothing in the queue. However my queue always seems to return True (problem #1). I think this is because i preallocate the queue.
typedef boost::lockfree::spsc_queue<q_pl, boost::lockfree::capacity<100000> > spsc_queue;
This means that to use the queue efficiently i have to busy wait constantly popping the queue using 100% cpu. Id rather not sleep for arbitrary amounts of time. I've used other queues in java which block until an object is made available. Can this be done with std:: or boost:: data structures?

A lock free queue, by definition, does not have blocking operations.
How would you synchronize on the datastructure? There is no internal lock, for obvious reasons, because that would mean all clients need to synchronize on it, making it your grandfathers locking concurrent queue.
So indeed, you will have to devise a waiting function yourself. How you do this depends on your use case, which is probably why the library doesn't supply one (disclaimer: I haven't checked and I don't claim to know the full documentation).
So what can you do:
As you already described, you can spin in a tight loop. Obviously, you'll do this if you know that your wait condition (queue non-empty) is always going to be satisfied very quickly.
Alternatively, poll the queue at a certain frequency (doing micro-sleeps in the mean time). Scheduling a good good frequency is an art: for some applications 100ms is optimal, for others, a potential 100ms wait would destroy throughput. So, vary and measure your performance indicators (don't forget about power consumption if your application is going to be deployed on many cores in a datacenter :)).
Lastly, you could arrive at a hybrid solution, spinning for a fixed number of iterations, and resorting to (increasing) interval polling if nothing arrives. This would nicely support servers applications where high loads occur in bursts.

Use a semaphore to cause the producers to sleep when the queue is full, and another semaphore to cause the consumers to sleep when the queue is empty.
when the queue is neither full nor empty, the sem_post and sem_wait operations are nonblocking (in newer kernels)
#include <semaphore.h>
template<typename lock_free_container>
class blocking_lock_free
{
public:
lock_free_queue_semaphore(size_t n) : container(n)
{
sem_init(&pop_semaphore, 0, 0);
sem_init(&push_semaphore, 0, n);
}
~lock_free_queue_semaphore()
{
sem_destroy(&pop_semaphore);
sem_destroy(&push_semaphore);
}
bool push(const lock_free_container::value_type& v)
{
sem_wait(&push_semaphore);
bool ret = container::bounded_push(v);
ASSERT(ret);
if (ret)
sem_post(&pop_semaphore);
else
sem_post(&push_semaphore); // shouldn't happen
return ret;
}
bool pop(lock_free_container::value_type& v)
{
sem_wait(&pop_semaphore);
bool ret = container::pop(v);
ASSERT(ret);
if (ret)
sem_post(&push_semaphore);
else
sem_post(&pop_semaphore); // shouldn't happen
return ret;
}
private:
lock_free_container container;
sem_t pop_semaphore;
sem_t push_semaphore;
};

Keep Track of Reference to Data ( How Many / Who ) in Multithreading

I came across a problem in multithreading, Model of multithreading is 1 Producer - N Consumer.
Producer produces the data (character data around 200bytes each), put it in fixed size cache ( i.e 2Mil). The data is not relevent to all the threads. It apply the filter ( configured ) and determines no of threads qualify for the produced data.
Producer pushes the pointer to data into the queue of qualifying threads ( only pointer to the data to avoid data copy). Threads will deque and send it over TCP/IP to their clients.
Problem: Because of only pointer to data is given to multiple threads, When cache becomes full, Produces wants to delete the first item(old one). possibility of any thread still referring to the data.
Feasible Way : Use Atomic granularity, When producer determines the number of qualifying threads, It can update the counter and list of thread ids.
class InUseCounter
{
int m_count;
set<thread_t> m_in_use_threads;
Mutex m_mutex;
Condition m_cond;
public:
// This constructor used by Producer
InUseCounter(int count, set<thread_t> tlist)
{
m_count = count;
m_in_use_threads = tlist;
}
// This function is called by each threads
// When they are done with the data,
// Informing that I no longer use the reference to the data.
void decrement(thread_t tid)
{
Gaurd<Mutex> lock(m_mutex);
--m_count;
m_in_use_threads.erease(tid);
}
int get_count() const { return m_count; }
};
master chache
map<seqnum, Data>
|
v
pair<CharData, InUseCounter>
When producer removes the element it checks the counter, is more than 0, it sends action to release the reference to threads in m_in_use_threads set.
Question
If there are 2Mil records in master cache, there will be equal
number of InUseCounter, so the Mutex varibles, Is this advisable to have 2Mil mutex varible in one single process.
Having big single data structure to maintain the InUseCounter will
cause more locking time to find and decrement
What would be the best alternative to my approach to find out the references, and who
all have the references with very less locking time.
Advance thanks for you advices.

2 million mutexes is a bit much. Even if they are lightweight locks,
they still take up some overhead.
Putting the InUseCounter in a single structure would end up involving contention between threads when they release a record; if the threads do not execute in lockstep, this might be negligible. If they are frequently releasing records and the contention rate goes up, this is obviously a performance sink.
You can improve performance by having one thread responsible for maintaining the record reference counts (the producer thread) and having the other threads send back record release events over a separate queue, in effect, turning the producer into a record release event consumer. When you need to flush an entry, process all the release queues first, then run your release logic. You will have some latency to deal with, as you are now queueing up release events instead of attempting to process them immediately, but the performance should be much better.
Incidentally, this is similar to how the Disruptor framework works. It's a high performance Java(!) concurrency framework for high frequency trading. Yes, I did say high performance Java and concurrency in the same sentence. There is a lot of valuable insight into high performance concurrency design and implementation.

Since you already have a Producer->Consumer queue, one very simple system consists in having a "feedback" queue (Consumer->Producer).
After having consumed an item, the consumer feeds the pointer back to the Producer so that the Producer can remove the item and updates the "free-list" of the cache.
This way, only the Producer ever touches the cache innards, and no synchronization is necessary there: only the queues need be synchronized.

Yes, 2000000 mutexes are an overkill.
1 big structure will be locked longer, but will require much less lock/unlocks.
the best approach would be to use shared_ptr smart pointers: they seem to be tailor made for this. You don't check the counter yourself, you just clean up your pointer. shared_ptr is thread-safe, not the data it points to, but for 1 producer (writer) / N consumer (readers), this should not be an issue.

What is better for a message queue? mutex & cond or mutex&semaphore?

I am implementing a C++ message queue based on a std::queue.
As I need popers to wait on an empty queue I was considering using mutex for mutual exclusion and cond for suspending threads on empty queue, as glib does with the gasyncqueue.
However it looks to me that a mutex&semaphore would do the job, I think it contains an integer and that seems like a pretty high number to reach on pending messages.
Pros of semaphore are that you don't need to check manually the condition each time you return from a wait, as you now for sure that someone inserted something(when someone inserted 2 items and you are the second thread arriving).
Which one would you choose?
EDIT:
Changed the question in response to #Greg Rogers

A single semaphore does not do the job - you need to be comparing (mutex + semaphore) and (mutex + condition variable).
It is pretty easy to see this by trying to implement it:
void push(T t)
{
queue.push(t);
sem.post();
}
T pop()
{
sem.wait();
T t = queue.top();
queue.pop();
return t;
}
As you can see there is no mutual exclusion when you are actually reading/writing to the queue, even though the signalling (from the semaphore) is there. Multiple threads can call push at the same time and break the queue, or multiple threads could call pop at the same time and break it. Or, a thread could call pop and be removing the first element of the queue while another thread called push.
You should use whichever you think is easier to implement, I doubt performance will vary much if any (it might be interesting to measure though).

Personally I use a mutex to serialize access to the list, and wake up the consumer by sending a byte over a socket (produced by socketpair()). That may be somewhat less efficient than a semaphore or condition variable, but it has the advantage of allowing the consumer to block in select()/poll(). That way the consumer can also wait on other things besides the data queue, if it wants to. It also lets you use the exact same queueing code on almost all OS's, since practically every OS supports the BSD sockets API.
Psuedocode follows:
// Called by the producer. Adds a data item to the queue, and sends a byte
// on the socket to notify the consumer, if necessary
void PushToQueue(const DataItem & di)
{
mutex.Lock();
bool sendSignal = (queue.size() == 0);
queue.push_back(di);
mutex.Unlock();
if (sendSignal) producerSocket.SendAByteNonBlocking();
}
// Called by consumer after consumerSocket selects as ready-for-read
// Returns true if (di) was written to, or false if there wasn't anything to read after all
// Consumer should call this in a loop until it returns false, and then
// go back to sleep inside select() to wait for further data from the producer
bool PopFromQueue(DataItem & di)
{
consumerSocket.ReadAsManyBytesAsPossibleWithoutBlockingAndThrowThemAway();
mutex.Lock();
bool ret = (queue.size() > 0);
if (ret) queue.pop_front(di);
mutex.Unlock();
return ret;
}

If You want to allow multiple simultaneously users at a time to use your queue, you should use semaphores.
sema(10) // ten threads/process have the concurrent access.
sema_lock(&sema_obj)
queue
sema_unlock(&sema_obj)
Mutex will "authorize" only one user at a time.
pthread_mutex_lock(&mutex_obj)
global_data;
pthread_mutex_unlock(&mutex_obj)
That's the main difference and You should decide which solution will fit your requirements.
But I'd choose mutex approach, because You don't need to specifies how many users can grab your resource.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js