pthread_cond_wait and pthread_cond_signal's performance - c++

I have two threads.
One that reads from the a queue. I don't want it to run on while(1) to read, so I'm thinking about giving it a condition variable each looping:
while(1){
while queue is not empty
wait(cond)
pop()
}
instead of:
while(1){
while queue is not empty
pop
}
and a thread that pushes to the queue.If I use the wait and signal method , then that thread needs to notify by signaling the popping thread every time(!) it pushes
The question is what is better to use?
If the queue is mostly not empty, then it's worthless (or is it not?) to send signals because the popping thread isn't waiting and I'm afraid it would reduce performance.
However, if the queue is half the time empty, then looping on it as in the second method to pop might be a busy wait.
I'm hoping someone here would eliminate my fears by disqualifying the fact that sending a signal to a thread that's not waiting on it is still ok
Thanks

First just to make sure, pthread_cond_signal does not send a signal in the signal(2) sense. It just flags the condition variable and releases any variable that's waiting for it. So if you call pthread_cond_signal before the consuming process calls well, that's just going to be ignored.
Secondly, is pthread_cond_wait fast or slow? Well, it depends. You can use it poorly and you can use it well. If you use it poorly I'm sure it will perform terribly. If you only wait when you actually need to I think it will perform nicely.
So, since you need to hold a mutex to use condition variables, you might as well check if there is data at this point (and use this mutex as a synchronization point).
An idea for a queue data structure:
struct q {
struct qe *h;
struct qe *t;
pthread_mutex_t m;
pthread_cond_t c;
int len;
};
The consumer (assuming you only have one consumer, if you have multiple to need to lock around the head check):
void *consumer(void*arg) {
struct q *q = arg;
while(1) {
pthread_mutex_lock(&q->m);
if(q->h == NULL)
pthread_cond_wait(&q->c, &q->m);
/* We hold the mutex when we exit pthread_cond_wait */
pthread_mutex_unlock(&q->m); /* we can make the check without the mutex */
while(q->h != NULL) {
pthread_mutex_lock(&q->m); /* but we need it to modify */
pop();
pthread_mutex_unlock(&q->m);
/* Process data */
}
}
}
The producer:
void *producer(void*arg) {
int i;
struct q *q = arg;
while(1) {
pthread_mutex_lock(&q->m);
push(q, some_data);
if(q->h == q->t) /* only one element */
pthread_cond_signal(&q->c);
pthread_mutex_unlock(&q->m);
}
return NULL;
}

Related

How to build concurrent access to queue where consumer is slower than publisher?

I have a few noob c++ multithreading questions. Would appreciate any input.
I am trying to solve Project Euler P3 using multithreading in a way that teaches me about multithreading and C++ (so it probably isn't the optimal algorithm). The algorithm is to find the factors, push them into a max priority queue, and pop items from the PQ to check if they are prime.
I have one function checks if a number is prime (takes a long time), one that finds factors and pushes them to the PQ (fast), and one that pops items from the PQ and calls isPrime(). My idea is to have two threads; one responsible for pushing data to a shared PQ and one responsible for popping data and checking for primality.
If I understand correctly:
Undefined behavior occurs when the popAndProcess() function naively checks if the PQ is empty as it's break condition. This is because the push thread might be paused by the OS before pushing all of it's data, and the pop thread may then empty the PQ, at which point the function would end.
The first thing I did was have the push() function set a shared flag when it was done pushing data. The popAndProcess() thread would then use a while loop which terminates when the flag is set AND the PQ is empty. This seems bad because it causes 'busy waiting'. When the PQ is empty and flag isn't set, popAndProcess() does a bunch a useless while loop rounds.
So then I tried using a condition_variable, where the push() function/thread sends a signal everytime it pushes data. The popAndProcess() function looks for this signal anytime the flag isn't set AND the PQ is empty. First off, I am wondering what happens when the push() sends a signal while popAndProcess() isn't waiting for it. This should happen often since pushes are faster than popAndProcess(). Is this the right design for the problem?
Various secondary questions
Secondly, I have run into a deadlock and don't know how to get around it. I think it has to do with if (pq.empty()) cond.wait(locker, [&](){return !pq.empty();}); and the fact that I have to construct the unique_lock BEFORE the that line in order to pass it to the wait function.
Why does the unique_lock constructor lock the mutex passed to it on it's own? Doesn't the wait(lock) immediately unlock it, resulting in wasted time locking and unlocking?
Generally, I don't understand why the wait() function requires a mutex type object (I understand that it can be useful to avoid having a thread sleep while holding a lock). Why can't wait() just put a thread to sleep until a signal is received?
class P3 {
// Important shared variables:
bool fin;
priority_queue<long long> pq;
mutex mu;
condition_variable cond;
static bool isPrime(long long x) {
//...
}
void push() {
//push() pushes data to the Q a finite and known amount of times.
long long sqt = (long long)sqrt(N);
for (int i = 1; i<sqt; i++) {
if (!(N%i)) {
unique_lock<mutex> locker(mu);
long long temp = N/i;
pq.push(temp);
pq.push(i);
locker.unlock();
cond.notify_one();
}
}
fin = true;
}
void popAndProcess() {
// while push() hasn't finished, or push() has finished and PQ has data...
while (!fin || (fin && !pq.empty())) {
// Q: Why do I have to pass a lock to the wait() function?
unique_lock<mutex> locker(mu);
// if the pq is empty (and since we have reached here, push() isn't done)
// wait() for push() thread to push new data.
// o/w, no need to wait for a signal, as there is data to be processed.
if (pq.empty()) cond.wait(locker, [&](){return !pq.empty();});
long long cur = pq.top();
pq.pop();
locker.unlock();
if (isPrime(cur)) {
ans = max(ans, cur);
}
}
}
void driver() {
fin = false;
thread t1(&P3::push, this);
popAndProcess();
t1.join();
}
// rest of class...

And odd use of conditional variable with local mutex

Poring through legacy code of old and large project, I had found that there was used some odd method of creating thread-safe queue, something like this:
template < typename _Msg>
class WaitQue: public QWaitCondition
{
public:
typedef _Msg DataType;
void wakeOne(const DataType& msg)
{
QMutexLocker lock_(&mx);
que.push(msg);
QWaitCondition::wakeOne();
}
void wait(DataType& msg)
{
/// wait if empty.
{
QMutex wx; // WHAT?
QMutexLocker cvlock_(&wx);
if (que.empty())
QWaitCondition::wait(&wx);
}
{
QMutexLocker _wlock(&mx);
msg = que.front();
que.pop();
}
}
unsigned long size() {
QMutexLocker lock_(&mx);
return que.size();
}
private:
std::queue<DataType> que;
QMutex mx;
};
wakeOne is used from threads as kind of "posting" function" and wait is called from other threads and waits indefinitely until a message appears in queue. In some cases roles between threads reverse at different stages and using separate queues.
Is this even legal way to use a QMutex by creating local one? I kind of understand why someone could do that to dodge deadlock while reading size of que but how it even works? Is there a simpler and more idiomatic way to achieve this behavior?
Its legal to have a local condition variable. But it normally makes no sense.
As you've worked out in this case is wrong. You should be using the member:
void wait(DataType& msg)
{
QMutexLocker cvlock_(&mx);
while (que.empty())
QWaitCondition::wait(&mx);
msg = que.front();
que.pop();
}
Notice also that you must have while instead of if around the call to QWaitCondition::wait. This is for complex reasons about (possible) spurious wake up - the Qt docs aren't clear here. But more importantly the fact that the wake and the subsequent reacquire of the mutex is not an atomic operation means you must recheck the variable queue for emptiness. It could be this last case where you previously were getting deadlocks/UB.
Consider the scenario of an empty queue and a caller (thread 1) to wait into QWaitCondition::wait. This thread blocks. Then thread 2 comes along and adds an item to the queue and calls wakeOne. Thread 1 gets woken up and tries to reacquire the mutex. However, thread 3 comes along in your implementation of wait, takes the mutex before thread 1, sees the queue isn't empty, processes the single item and moves on, releasing the mutex. Then thread 1 which has been woken up finally acquires the mutex, returns from QWaitCondition::wait and tries to process... an empty queue. Yikes.

confusion with semaphore definitions

For semaphore implementations, what does process specify? In the context of the producer/consumer problem, is the process the producer method/Consumer method? Or is it P() if we are in P() and the value is less than 0?
P() {
value = value –1;
If value < 0
add the calling process to this semaphore’s list;
block this process
}
EXAMPLE
If Consumer runs first before Producer produces its first item
Consumer would decrement the full value -> full = -1
and then since the value is less than 1, it would add the calling process to this semaphore’s list. But I’m not sure what process is.
And what does it mean to block this process? Does it mean that the entire method for consumer is at halt, and producer method runs?
code:
#define N 100
typedef int semaphore;
Semaphore fullBuffer = 0; // Initially, no item in buffer
Semaphore empty = N; // Initially, num empty buffer
Semaphore mutex = 1; // No thread updating the buffer
void producer(void) {
int item;
while(TRUE){
item = produce_item();
down(&empty);
down(&mutex);
insert_item(item);
up(&mutex);
up(&full);
}
}
void consumer(void) {
int item;
while(TRUE){
down(&full);
down(&mutex);
item = remove_item();
up(&mutex);
up(&empty);
consume_item(item);
}
}
A process, in this usage, is exactly like a thread. Usually when 'multiprocess' is used instead of 'multithreaded', it implies that the kernal handles the threading, which allows the computer to take advantage of multiple cores. However, that isn't important for this specific implementation, and is also false for this specific implementation, because nothing is atomic.
Blocking the process here means that a process that calls P and decrememnts the value to anything negative will halt its own execution when it reaches the 'block this process' command.
Assuming multi threading, your 'producer' command will continually decrease the empty semaphore unless it tries to decrement it below zero, in which case it will be halted, and only the 'consumer' command will run. At least, only 'consumer' will run until it increases the empty semaphore enough that 'producer' can now run. You can also switch both 'empty'<->'full' and 'producer'<->'consumer' in the previous two sentences, and they should remain correct.
Also, I suggest you read up on semaphores elsewhere, because they are a basic part of threading/multiprocessing, and other people have described them better than I ever could. (Look at the producer/consumer example there.)

How to make thread synchronization without using mutex, semorphore, spinLock and futex?

This is an interview question, the interview has been done.
How to make thread synchronization without using mutex, semorphore, spinLock and futex ?
Given 5 threads, how to make 4 of them wait for a signal from the left thread at the same point ?
it means that when all threads (1,2,3,4) execute at a point in their thread function, they stop and wait for
signal from thread 5 send a signal otherwise they will not proceed.
My idea:
Use global bool variable as a flag, if thread 5 does not set it true, all other threads wait at one point and also set their
flag variable true. After the thread 5 find all threads' flag variables are true, it will set it flag var true.
It is a busy-wait.
Any better ideas ?
Thanks
the pseudo code:
bool globalflag = false;
bool a[10] = {false} ;
int main()
{
for (int i = 0 ; i < 10; i++)
pthread_create( threadfunc, i ) ;
while(1)
{
bool b = true;
for (int i = 0 ; i < 10 ; i++)
{
b = a[i] & b ;
}
if (b) break;
}
}
void threadfunc(i)
{
a[i] = true;
while(!globalflag);
}
Start with an empty linked list of waiting threads. The head should be set to 0.
Use CAS, compare and swap, to insert a thread at the head of the list of waiters. If the head =-1, then do not insert or wait. You can safely use CAS to insert items at the head of a linked list if you do it right.
After being inserted, the waiting thread should wait on SIGUSR1. Use sigwait() to do this.
When ready, the signaling thread uses CAS to set the head of wait list to -1. This prevents any more threads from adding themselves to the wait list. Then the signaling thread iterates the threads in the wait list and calls pthread_kill(&thread, SIGUSR1) to wake up each waiting thread.
If SIGUSR1 is sent before a call to sigwait, sigwait will return immediately. Thus, there will not be a race between adding a thread to the wait list and calling sigwait.
EDIT:
Why is CAS faster than a mutex? Laymen's answer (I'm a layman). Its faster for some things in some situations, because it has lower overhead when there is NO race. So if you can reduce your concurrent problem down to needing to change 8-16-32-64-128 bits of contiguous memory, and a race is not going to happen very often, CAS wins. CAS is basically a slightly more fancy/expensive mov instruction right where you were going to do a regular "mov" anyway. Its a "lock exchng" or something like that.
A mutex on the other hand is a whole bunch of extra stuff, that gets other cache lines dirty and uses more memory barriers, etc. Although CAS acts as a memory barrier on the x86, x64, etc. Then of course you have to unlock the mutex which is probably about the same amount of extra stuff.
Here is how you add an item to a linked list using CAS:
while (1)
{
pOldHead = pHead; <-- snapshot of the world. Start of the race.
pItem->pNext = pHead;
if (CAS(&pHead, pOldHead, pItem)) <-- end of the race if phead still is pOldHead
break; // success
}
So how often do you think your code is going to have multiple threads at that CAS line at the exact same time? In reality....not very often. We did tests that just looped adding millions of items with multiple threads at the same time and it happens way less than 1% of the time. In a real program, it might never happen.
Obviously if there is a race you have to go back and do that loop again, but in the case of a linked list, what does that cost you?
The downside is that you can't do very complex things to that linked list if you are going to use that method to add items to the head. Try implementing a double linked list. What a pain.
EDIT:
In the code above I use a macro CAS. If you are using linux, CAS = macro using __sync_bool_compare_and_swap. See gcc atomic builtins. If you are using windows, CAS = macro using something like InterlockedCompareExchange. Here is what an inline function in windows might look like:
inline bool CAS(volatile WORD* p, const WORD nOld, const WORD nNew) {
return InterlockedCompareExchange16((short*)p, nNew, nOld) == nOld;
}
inline bool CAS(volatile DWORD* p, const DWORD nOld, const DWORD nNew) {
return InterlockedCompareExchange((long*)p, nNew, nOld) == nOld;
}
inline bool CAS(volatile QWORD* p, const QWORD nOld, const QWORD nNew) {
return InterlockedCompareExchange64((LONGLONG*)p, nNew, nOld) == nOld;
}
inline bool CAS(void*volatile* p, const void* pOld, const void* pNew) {
return InterlockedCompareExchangePointer(p, (PVOID)pNew, (PVOID)pOld) == pOld;
}
Choose a signal to use, say SIGUSR1.
Use pthread_sigmask to block SIGUSR1.
Create the threads (they inherit the signal mask, hence 1 must be done first!)
Threads 1-4 call sigwait, blocking until SIGUSR1 is received.
Thread 5 calls kill() or pthread_kill 4 times with SIGUSR1. Since POSIX specifies that signals will be delivered to a thread which is not blocking the signal, it will be delivered to one of the threads waiting in sigwait(). There is thus no need to keep track of which threads have already received the signal and which haven't, with associated synchronization.
You can do this using SSE3's MONITOR and MWAIT instructions, available via the _mm_mwait and _mm_monitor intrinsics, Intel has an article on it here.
(there is also a patent for using memory-monitor-wait for lock contention here that may be of interest).
I think you are looking the Peterson's algorithm or Dekker's algorithm
They synced threads only based on shared memory

C++ multithreading, simple consumer / producer threads, LIFO, notification, counter

I am new to multi-thread programming, I want to implement the following functionality.
There are 2 threads, producer and consumer.
Consumer only processes the latest value, i.e., last in first out (LIFO).
Producer sometimes generates new value at a faster rate than consumer can
process. For example, producer may generate 2 new value in 1
milli-second, but it approximately takes consumer 5 milli-seconds to process.
If consumer receives a new value in the middle of processing an old
value, there is no need to interrupt. In other words, consumer will finish current
execution first, then start an execution on the latest value.
Here is my design process, please correct me if I am wrong.
There is no need for a queue, since only the latest value is
processed by consumer.
Is notification sent from producer being queued automatically???
I will use a counter instead.
ConsumerThread() check the counter at the end, to make sure producer
doesn't generate new value.
But what happen if producer generates a new value just before consumer
goes to sleep(), but after check the counter???
Here is some pseudo code.
boost::mutex mutex;
double x;
void ProducerThread()
{
{
boost::scoped_lock lock(mutex);
x = rand();
counter++;
}
notify(); // wake up consumer thread
}
void ConsumerThread()
{
counter = 0; // reset counter, only process the latest value
... do something which takes 5 milli-seconds ...
if (counter > 0)
{
... execute this function again, not too sure how to implement this ...
}
else
{
... what happen if producer generates a new value here??? ...
sleep();
}
}
Thanks.
If I understood your question correctly, for your particular application, the consumer only needs to process the latest available value provided by the producer. In other words, it's acceptable for values to get dropped because the consumer cannot keep up with the producer.
If that's the case, then I agree that you can get away without a queue and use a counter. However, the shared counter and value variables will be need to be accessed atomically.
You can use boost::condition_variable to signal notifications to the consumer that a new value is ready. Here is a complete example; I'll let the comments do the explaining.
#include <boost/thread/thread.hpp>
#include <boost/thread/mutex.hpp>
#include <boost/thread/condition_variable.hpp>
#include <boost/thread/locks.hpp>
#include <boost/date_time/posix_time/posix_time_types.hpp>
boost::mutex mutex;
boost::condition_variable condvar;
typedef boost::unique_lock<boost::mutex> LockType;
// Variables that are shared between producer and consumer.
double value = 0;
int count = 0;
void producer()
{
while (true)
{
{
// value and counter must both be updated atomically
// using a mutex lock
LockType lock(mutex);
value = std::rand();
++count;
// Notify the consumer that a new value is ready.
condvar.notify_one();
}
// Simulate exaggerated 2ms delay
boost::this_thread::sleep(boost::posix_time::milliseconds(200));
}
}
void consumer()
{
// Local copies of 'count' and 'value' variables. We want to do the
// work using local copies so that they don't get clobbered by
// the producer when it updates.
int currentCount = 0;
double currentValue = 0;
while (true)
{
{
// Acquire the mutex before accessing 'count' and 'value' variables.
LockType lock(mutex); // mutex is locked while in this scope
while (count == currentCount)
{
// Wait for producer to signal that there is a new value.
// While we are waiting, Boost releases the mutex so that
// other threads may acquire it.
condvar.wait(lock);
}
// `lock` is automatically re-acquired when we come out of
// condvar.wait(lock). So it's safe to access the 'value'
// variable at this point.
currentValue = value; // Grab a copy of the latest value
// while we hold the lock.
}
// Now that we are out of the mutex lock scope, we work with our
// local copy of `value`. The producer can keep on clobbering the
// 'value' variable all it wants, but it won't affect us here
// because we are now using `currentValue`.
std::cout << "value = " << currentValue << "\n";
// Simulate exaggerated 5ms delay
boost::this_thread::sleep(boost::posix_time::milliseconds(500));
}
}
int main()
{
boost::thread c(&consumer);
boost::thread p(&producer);
c.join();
p.join();
}
ADDENDUM
I was thinking about this question recently, and realized that this solution, while it may work, is not optimal. Your producer is using all that CPU just to throw away half of the computed values.
I suggest that you reconsider your design and go with a bounded blocking queue between the producer and consumer. Such a queue should have the following characteristics:
Thread-safe
The queue has a fixed size (bounded)
If the consumer wants to pop the next item, but the queue is empty, the operation will be blocked until notified by the producer that an item is available.
The producer can check if there's room to push another item and block until the space becomes available.
With this type of queue, you can effectively throttle down the producer so that it doesn't outpace the consumer. It also ensures that the producer doesn't waste CPU resources computing values that will be thrown away.
Libraries such as TBB and PPL provide implementations of concurrent queues. If you want to attempt to roll your own using std::queue (or boost::circular_buffer) and boost::condition_variable, check out this blogger's example.
The short answer is that you're almost certainly wrong.
With a producer/consumer, you pretty much need a queue between the two threads. There are basically two alternatives: either your code won't will simply lose tasks (which usually equals not working at all) or else your producer thread will need to block for the consumer thread to be idle before it can produce an item -- which effectively translates to single threading.
For the moment, I'm going to assume that the value you get back from rand is supposed to represent the task to be executed (i.e., is the value produced by the producer and consumed by the consumer). In that case, I'd write the code something like this:
void producer() {
for (int i=0; i<100; i++)
queue.insert(random()); // queue.insert blocks if queue is full
queue.insert(-1.0); // Tell consumer to exit
}
void consumer() {
double value;
while ((value = queue.get()) != -1) // queue.get blocks if queue is empty
process(value);
}
This, relegates nearly all the interlocking to the queue. The rest of the code for both threads pretty much ignores threading issues entirely.
Implementing a pipeline is actually quite tricky if you are doing it ground-up. For example, you'd have to use condition variable to avoid the kind of race condition you described in your question, avoid busy waiting when implementing the mechanism for "waking up" the consumer etc... Even using a "queue" of just 1 element won't save you from some of these complexities.
It's usually much better to use specialized libraries that were developed and extensively tested specifically for this purpose. If you can live with Visual C++ specific solution, take a look at Parallel Patterns Library, and the concept of Pipelines.