Implement a high performance mutex similar to Qt's one - c++

I have a multi-thread scientific application where several computing threads (one per core) have to store their results in a common buffer. This requires a mutex mechanism.
Working threads spend only a small fraction of their time writing to the buffer, so the mutex is unlocked most of the time, and locks have a high probability to succeed immediately without waiting for another thread to unlock.
Currently, I have used Qt's QMutex for the task, and it works well : the mutex has a negligible overhead.
However, I have to port it to c++11/STL only. When using std::mutex, the performance drops by 66% and the threads spend most of their time locking the mutex.
After another question, I figured that Qt uses a fast locking mechanism based on a simple atomic flag, optimized for cases where the mutex is not already locked. And falls back to a system mutex when concurrent locking occurs.
I would like to implement this in STL. Is there a simple way based on std::atomic and std::mutex ? I have digged in Qt's code but it seems overly complicated for my use (I do not need locks timeouts, pimpl, small footprint etc...).
Edit : I have tried a spinlock, but this does not work well because :
Periodically (every few seconds), another thread locks the mutexes and flushes the buffer. This takes some time, so all worker threads get blocked at this time. The spinlocks make the scheduling busy, causing the flush to be 10-100x slower than with a proper mutex. This is not acceptable
Edit : I have tried this, but it's not working (locks all threads)
class Mutex
{
public:
Mutex() : lockCounter(0) { }
void lock()
{
if(lockCounter.fetch_add(1, std::memory_order_acquire)>0)
{
std::unique_lock<std::mutex> lock(internalMutex);
cv.wait(lock);
}
}
void unlock();
{
if(lockCounter.fetch_sub(1, std::memory_order_release)>1)
{
cv.notify_one();
}
}
private:
std::atomic<int> lockCounter;
std::mutex internalMutex;
std::condition_variable cv;
};
Thanks!
Edit : Final solution
MikeMB's fast mutex was working pretty well.
As a final solution, I did:
Use a simple spinlock with a try_lock
When a thread fails to try_lock, instead of waiting, they fill a queue (which is not shared with other threads) and continue
When a thread gets a lock, it updates the buffer with the current result, but also with the results stored in the queue (it processes its queue)
The buffer flushing was made much more efficiently : the blocking part only swaps two pointers.

General Advice
As was mentioned in some comments, I'd first have a look, whether you can restructure your program design to make the mutex implementation less critical for your performance .
Also, as multithreading support in standard c++ is pretty new and somewhat infantile, you sometimes just have to fall back on platform specific mechanisms, like e.g. a futex on linux systems or critical sections on windows or non-standard libraries like Qt.
That being said, I could think of two implementation approaches that might potentially speed up your program:
Spinlock
If access collisions happen very rarely, and the mutex is only hold for short periods of time (two things one should strive to achieve anyway of course), it might be most efficient to just use a spinlock, as it doesn't require any system calls at all and it's simple to implement (taken from cppreference):
class SpinLock {
std::atomic_flag locked ;
public:
void lock() {
while (locked.test_and_set(std::memory_order_acquire)) {
std::this_thread::yield(); //<- this is not in the source but might improve performance.
}
}
void unlock() {
locked.clear(std::memory_order_release);
}
};
The drawback of course is that waiting threads don't stay asleep and steal processing time.
Checked Locking
This is essentially the idea you demonstrated: You first make a fast check, whether locking is actually needed based on an atomic swap operation and use a heavy std::mutex only if it is unavoidable.
struct FastMux {
//Status of the fast mutex
std::atomic<bool> locked;
//helper mutex and vc on which threads can wait in case of collision
std::mutex mux;
std::condition_variable cv;
//the maximum number of threads that might be waiting on the cv (conservative estimation)
std::atomic<int> cntr;
FastMux():locked(false), cntr(0){}
void lock() {
if (locked.exchange(true)) {
cntr++;
{
std::unique_lock<std::mutex> ul(mux);
cv.wait(ul, [&]{return !locked.exchange(true); });
}
cntr--;
}
}
void unlock() {
locked = false;
if (cntr > 0){
std::lock_guard<std::mutex> ul(mux);
cv.notify_one();
}
}
};
Note that the std::mutex is not locked in between lock() and unlock() but it is only used for handling the condition variable. This results in more calls to lock / unlock if there is high congestion on the mutex.
The problem with your implementation is, that cv.notify_one(); can potentially be called between if(lockCounter.fetch_add(1, std::memory_order_acquire)>0) and cv.wait(lock); so your thread might never wake up.
I didn't do any performance comparisons against a fixed version of your proposed implementation though so you just have to see what works best for you.

Not really an answer per definition, but depending on the specific task, a lock-free queue might help getting rid of the mutex at all. This would help the design, if you have multiple producers and a single consumer (or even multiple consumers). Links:
Though not directly C++/STL, Boost.Lockfree provides such a queue.
Another option is the lock-free queue implementation in "C++ Concurrency in Action" by Anthony Williams.
A Fast Lock-Free Queue for C++
Update wrt to comments:
Queue size / overflow:
Queue overflowing can be avoided by i) making the queue large enough or ii) by making the producer thread wait with pushing data once the queue is full.
Another option would be to use multiple consumers and multiple queues and implement a parallel reduction but this depends on how the data is treated.
Consumer thread:
The queue could use std::condition_variable and make the consumer thread wait until there is data.
Another option would be to use a timer for checking in regular intervals (polling) for the queue being non-empty, once it is non-empty the thread can continuously fetch data and the go back into wait-mode.

Related

Choice of semaphores, mutexes and condition variables

I have a producer thread and a consumer thread, with the producer being real-time and determinism-sensitive.
Hence I decided to hoist the processing out of the producer thread into a consumer thread, using a lock-free fifo queue. The goal is to have the consumer being both responsive but also avoiding busy-waiting, while never delaying the producer for a indeterminate amount of time; thus any allocations/locks (and kernel entries, I suppose?) etc. are completely out of the question.
I've implemented this pattern, which seems to work well, however I'm unsure of why a mutex is needed at all:
std::mutex m;
std::condition_variable cv;
void consumer()
{
std::unique_lock<std::mutex> lock(m);
while (1)
{
cv.wait(lock);
// process consumation...
}
}
void producer()
{
while (1)
{
// produce and post..
cv.notify_one();
}
}
The other canonical examples seems to lock the mutex in the producer as well, why? My data communication is already thread-safe, so this shouldn't be needed. Also, is this susceptible to missing signals?
And while researching this, I stumble upon semaphores which seem to be used explicitly for this situation. What are the benefits versus this system? I prefer my solution currently, just because it is a part of the standard library.
Semaphores and Condition Variables are somehow similar concepts. At least classical Counting Semaphores aren't available natively from the current c++ standard library. But these can be easily replaced with a std::condition_variable controlling an in-/decremented integer value.
The std::mutex for the condition variable is necessary, to protect from race conditions when changing the underlying value.

Suitable thread "fence" for a worker thread

I have a Worker class that runs its own thread to do some work in parallel. During specific intervalls I want it to be idle. I have an interface
class Worker
{
mutex m_wait;
void pause() {
m_wait.lock();
}
void continue() {
m_wait.unlock();
}
static doWork(mutex& lock) {
while(true) {
{
lock_guard _l(lock);
// Lock and immidiatly unlock again
}
// Only *some* work
}
}
};
As you can see the doWork method checks once in a while if the mutex is locked (by pause()) and doesn't continue if that is.
I have some questions about this implementation regarding speed and alternatives:
How much overhead is there from checking the lock? Suppose I make a matrix-multiplication of 10x10*10x10 in the real work (small datasets). then
Is there an alternative, that achieves the same effect but has less overhead?
The cost of locking/unlocking a mutex depends on your platform and specific mutex implementation. In the best case locking/unlocking a mutex where there is no contention is equivalent of two interlocked memory operations.
In case of contention locking becomes a costly operation because most implementations will make a kernel call to wait for the mutex to unlock. However in that case you probably do not care as the whole point would be to stop the thread in the first place.

How to Create Thread-Safe Buffers / POD?

My problem is quite common I suppose, but it drives me crazy:
I have a multi-threaded application with 5 threads. 4 of these threads do their job, like network communication and local file system access, and then all write their output to a data structure of this form:
struct Buffer {
std::vector<std::string> lines;
bool has_been_modified;
}
The 5th thread prints these buffer/structures to the screen:
Buffer buf1, buf2, buf3, buf4;
...
if ( buf1.has_been_modified ||
buf2.has_been_modified ||
buf3.has_been_modified ||
buf4.has_been_modified )
{
redraw_screen_from_buffers();
}
How do I protect the buffers from being overwritten while they are either being read from or written to?
I can't find a proper solution, although I think this has to be a quiet common problem.
Thanks.
You should use a mutex. The mutex class is std::mutex. With C++11 you can use std::lock_guard<std::mutex> to encapsulate the mutex using RAII. So you would change your Buffer struct to
struct Buffer {
std::vector<std::string> lines;
bool has_been_modified;
std::mutex mutex;
};
and whenever you read or write to the buffer or has_been_modified you would do
std::lock_guard<std::mutex> lockGuard(Buffer.mutex); //Do this for each buffer you want to access
... //Access buffer here
and the mutex will be automatically released by the lock_guard when it is destroyed.
You can read more about mutexes here.
You can use a mutex (or mutexes) around the buffers to ensure that they're not modified by multiple threads at the same time.
// Mutex shared between the multiple threads
std::mutex g_BufferMutex;
void redraw_screen_from_buffers()
{
std::lock_guard<std::mutex> bufferLockGuard(g_BufferMutex);
//redraw here after mutex has been locked.
}
Then your buffer modification code would have to lock the same mutex when the buffers are being modified.
void updateBuffer()
{
std::lock_guard<std::mutex> bufferLockGuard(g_BufferMutex);
// update here after mutex has been locked
}
This contains some mutex examples.
What appears you want to accomplish is to have multiple threads/workers and one observer. The latter needs to do its job only when all workers are done/signal. If this is the case then check code in this SO q/a. std::condition_variable - Wait for several threads to notify observer
mutex are a very nice thing when trying to avoid dataraces, and I'm sure the answer posted by #Phantom will satisfy most people. However, one should know that this is not scalable to large systems.
By locking you are synchronising your threads. As only one at a time can be accessing the vector, on thread writting to the container will cause the other one to wait for it to finish ... with may be good for you but causes serious performance botleneck when high performance is needed.
The best solution would be to use a more complexe lock free structure. Unfortunatelly I don't think there is any standart lockfree structure in the STL. One exemple of lockfree queue is available here
Using such a structure, your 4 working threads would be able to enqueue messages to the container while the 5th one would dequeue them, without any dataraces
More on lockfree datastructure can be found here !

Is there a facility in boost to allow for write-biased locking?

If I have the following code:
#include <boost/date_time.hpp>
#include <boost/thread.hpp>
boost::shared_mutex g_sharedMutex;
void reader()
{
boost::shared_lock<boost::shared_mutex> lock(g_sharedMutex);
boost::this_thread::sleep(boost::posix_time::seconds(10));
}
void makeReaders()
{
while (1)
{
boost::thread ar(reader);
boost::this_thread::sleep(boost::posix_time::seconds(3));
}
}
boost::thread mr(makeReaders);
boost::this_thread::sleep(boost::posix_time::seconds(5));
boost::unique_lock<boost::shared_mutex> lock(g_sharedMutex);
...
the unique lock will never be acquired, because there are always going to be readers. I want a unique_lock that, when it starts waiting, prevents any new read locks from gaining access to the mutex (called a write-biased or write-preferred lock, based on my wiki searching). Is there a simple way to do this with boost? Or would I need to write my own?
Note that I won't comment on the win32 implementation because it's way more involved and I don't have the time to go through it in detail. That being said, it's interface is the same as the pthread implementation which means that the following answer should be equally valid.
The relevant pieces of the pthread implementation of boost::shared_mutex as of v1.51.0:
void lock_shared()
{
boost::this_thread::disable_interruption do_not_disturb;
boost::mutex::scoped_lock lk(state_change);
while(state.exclusive || state.exclusive_waiting_blocked)
{
shared_cond.wait(lk);
}
++state.shared_count;
}
void lock()
{
boost::this_thread::disable_interruption do_not_disturb;
boost::mutex::scoped_lock lk(state_change);
while(state.shared_count || state.exclusive)
{
state.exclusive_waiting_blocked=true;
exclusive_cond.wait(lk);
}
state.exclusive=true;
}
The while loop conditions are the most relevant part for you. For the lock_shared function (read lock), notice how the while loop will not terminate as long as there's a thread trying to acquire (state.exclusive_waiting_blocked) or already owns (state.exclusive) the lock. This essentially means that write locks have priority over read locks.
For the lock function (write lock), the while loop will not terminate as long as there's at least one thread that currently owns the read lock (state.shared_count) or another thread owns the write lock (state.exclusive). This essentially gives you the usual mutual exclusion guarantees.
As for deadlocks, well the read lock will always return as long as the write locks are guaranteed to be unlocked once they are acquired. As for the write lock, it's guaranteed to return as long as the read locks and the write locks are always guaranteed to be unlocked once acquired.
In case you're wondering, the state_change mutex is used to ensure that there's no concurrent calls to either of these functions. I'm not going to go through the unlock functions because they're a bit more involved. Feel free to look them over yourself, you have the source after all (boost/thread/pthread/shared_mutex.hpp) :)
All in all, this is pretty much a text book implementation and they've been extensively tested in a wide range of scenarios (libs/thread/test/test_shared_mutex.cpp and massive use across the industry). I wouldn't worry too much as long you use them idiomatically (no recursive locking and always lock using the RAII helpers). If you still don't trust the implementation, then you could write a randomized test that simulates whatever test case you're worried about and let it run overnight on hundreds of thread. That's usually a good way to tease out deadlocks.
Now why would you see that a read lock is acquired after a write lock is requested? Difficult to say without seeing the diagnostic code that you're using. Chances are that the read lock is acquired after your print statement (or whatever you're using) is completed and before state_change lock is acquired in the write thread.

pthreads: thread starvation caused by quick re-locking

I have a two threads, one which works in a tight loop, and the other which occasionally needs to perform a synchronization with the first:
// thread 1
while(1)
{
lock(work);
// perform work
unlock(work);
}
// thread 2
while(1)
{
// unrelated work that takes a while
lock(work);
// synchronizing step
unlock(work);
}
My intention is that thread 2 can, by taking the lock, effectively pause thread 1 and perform the necessary synchronization. Thread 1 can also offer to pause, by unlocking, and if thread 2 is not waiting on lock, re-lock and return to work.
The problem I have encountered is that mutexes are not fair, so thread 1 quickly re-locks the mutex and starves thread 2. I have attempted to use pthread_yield, and so far it seems to run okay, but I am not sure it will work for all systems / number of cores. Is there a way to guarantee that thread 1 will always yield to thread 2, even on multi-core systems?
What is the most effective way of handling this synchronization process?
You can build a FIFO "ticket lock" on top of pthreads mutexes, along these lines:
#include <pthread.h>
typedef struct ticket_lock {
pthread_cond_t cond;
pthread_mutex_t mutex;
unsigned long queue_head, queue_tail;
} ticket_lock_t;
#define TICKET_LOCK_INITIALIZER { PTHREAD_COND_INITIALIZER, PTHREAD_MUTEX_INITIALIZER }
void ticket_lock(ticket_lock_t *ticket)
{
unsigned long queue_me;
pthread_mutex_lock(&ticket->mutex);
queue_me = ticket->queue_tail++;
while (queue_me != ticket->queue_head)
{
pthread_cond_wait(&ticket->cond, &ticket->mutex);
}
pthread_mutex_unlock(&ticket->mutex);
}
void ticket_unlock(ticket_lock_t *ticket)
{
pthread_mutex_lock(&ticket->mutex);
ticket->queue_head++;
pthread_cond_broadcast(&ticket->cond);
pthread_mutex_unlock(&ticket->mutex);
}
Under this kind of scheme, no low-level pthreads mutex is held while a thread is within the ticketlock protected critical section, allowing other threads to join the queue.
In your case it is better to use condition variable to notify second thread when it is required to awake and perform all required operations.
pthread offers a notion of thread priority in its API. When two threads are competing over a mutex, the scheduling policy determines which one will get it. The function pthread_attr_setschedpolicy lets you set that, and pthread_attr_getschedpolicy permits retrieving the information.
Now the bad news:
When only two threads are locking / unlocking a mutex, I can’t see any sort of competition, the first who runs the atomic instruction takes it, the other blocks. I am not sure whether this attribute applies here.
The function can take different parameters (SCHED_FIFO, SCHED_RR, SCHED_OTHER and SCHED_SPORADIC), but in this question, it has been answered that only SCHED_OTHER was supported on linux)
So I would give it a shot if I were you, but not expect too much. pthread_yield seems more promising to me. More information available here.
Ticket lock above looks like the best. However, to insure your pthread_yield works, you could have a bool waiting, which is set and reset by thread2. thread1 yields as long as bool waiting is set.
Here's a simple solution which will work for your case (two threads). If you're using std::mutex then this class is a drop-in replacement. Change your mutex to this type and you are guaranteed that if one thread holds the lock and the other is waiting on it, once the first thread unlocks, the second thread will grab the lock before the first thread can lock it again.
If more than two threads happen to use the mutex simultaneously it will still function but there are no guarantees on fairness.
If you're using plain pthread_mutex_t you can easily change your locking code according to this example (unlock remains unchanged).
#include <mutex>
// Behaves the same as std::mutex but guarantees fairness as long as
// up to two threads are using (holding/waiting on) it.
// When one thread unlocks the mutex while another is waiting on it,
// the other is guaranteed to run before the first thread can lock it again.
class FairDualMutex : public std::mutex {
public:
void lock() {
_fairness_mutex.lock();
std::mutex::lock();
_fairness_mutex.unlock();
}
private:
std::mutex _fairness_mutex;
};