Upgradable mutex lies at shared memory on both Windows and Linux

Upgradable mutex lies at shared memory on both Windows and Linux - c++

I have 2 processes called Writer and Reader running on the same machine. Writer is a singular thread and writes data to a shared memory. Reader has 8 threads that intend to read data from the shared memory concurrently. I need a locking mechanism that meets following criteria:
1) At a time, either Writer or Reader is allowed to access the shared memory.
2) If Reader has permission to read data from the shared memory, all its own threads can read data.
3) Writer has to wait until Reader "completely" releases the lock (because it has multiple threads).
I have read much about sharable mutex that seems to be the solution. Here I describe more detailed about my system:
1) System should run on both Windows & Linux.
2) I divide the shared memory into two regions: locks & data. The data region is further divided into 100 blocks. I intend to create 100 "lock objects" (sharable mutex) and lay them on the locks region. These lock objects are used for synchronization of 100 the data blocks, 1 lock object for 1 data block.
3) Writer, Readers first determine which block it would like to access then try to acquire the appropriate lock. Once acquired the lock, it then performs on the data block.
My concern now is:
Is there any "built-in" way to lay the lock objects on shared memory on Windows and Linux (Centos) and then I can do lock/unlock with the objects without using boost library.

[Edited Feb 25, 2016, 09:30 GMT]
I can suggest a few things. It really depends on the requirements.
If it seems like the boost upgradeable mutex fits the bill, then by all means, use it. From 5 minute reading is seems you can use them in shm. I have no experience with it as I don't use boost. Boost is available on Windows and Linux so I don't see why not use it. You can always grab the specific code you like and bring it into your project without dragging the entire behemoth along.
Anyway, isn't it fairly easy to test and see is it good enough?
I don't understand the requirement for placing locks in shm. If it's no real requirement, and you want to use OS native objects, you can use a different mechanism per OS. Say, named mutex on Windows (not in shm), and pthread_rwlock, in shm, on Linux.
I know what I would prefer to use: a seqlock.
I work in the low-latency domain, so I'm picking what gets me the lowest possible latency. I measure it in cpu cycles.
From you mentioning that you want a lock per object, rather than one big lock, I assume performance is important.
There're important questions here, though:
Since it's in shm, I assume it's POD (flat data)? If not, you can switch to a read/write spinlock.
Are you ok with spinning (busy wait) or do you want to sleep-wait? seqlocks and spinlocks are no OS mechanism, so there's nobody to put your waiting threads to sleep. If you do want to sleep-wait, read #4
If you care to know the other side (reader/write) died, you have to impl that in some other way. Again, because seqlock is no OS beast. If you want to be notified of other side's death as part of the synchronization mechanism, you'll have to settle for named mutexes, on Windows, and on robust mutexes, in shm, on Linux
Spinlocks and seqlocks provide the maximum throughput and minimum latency. With kernel supported synchronization, a big part of the latency is spent in switching between user and kernel space. In most applications it is not a problem as synchronization is only happening in a small fraction of the time, and the extra latency of a few microseconds is negligible. Even in games, 100 fps leaves you with 10ms per frame, that is eternity in term of mutex lock/unlock.
There are alternatives to spinlock that are usually not much more expensive.
In Windows, Critical Section is actually a spinlock with a back-off mechanism that uses an Event object. This was re-implemented using shm and named Event and called Metered Section.
In Linux, the pthread mutex is futex based. A futex is like Event on Windows. A non-robust mutex with no contention is just a spinlock.
These guys still don't provide you with notification when the other side dies.
Addition [Feb 26, 2016, 10:00 GMT]
How to add your own owner death detection:
The Windows named mutex and pthread robust mutex have this capability built-in. It's easy enough to add it yourself when using other lock types and could be essential when using user-space-based locks.
First, I have to say, in many scenarios it's more appropriate to simply restart everything instead of detecting owner's death. It is definitely simpler as you also have to release the lock from a process that is not the original owner.
Anyway, native way to detect a process death is easy on Windows - processes are waitable objects so you can just wait on them. You can wait for zero time for an immediate check.
On Linux, only the parent is supposed to know about it's child's death, so less trivial. The parent can get SIGCHILD, or use waitpid().
My favorite way to detect process death is different. I connect a non-blocking TCP socket between the 2 processes and trust the OS to kill it on process death.
When you try to read data from the socket (on any of the sides) you'd read 0 bytes if the peer has died. If it's still alive, you'd get EWOULDBLOCK.
Obviously, this also works between boxes, so kinda convenient to have it uniformly done once and for all.
Your worker loop will have to change to interleave the peer death check and it's usual work.

#include <boost/interprocess/sync/interprocess_mutex.hpp>
#include <boost/interprocess/sync/interprocess_condition.hpp>**
//Mutex to protect access to the queue
boost::interprocess::interprocess_mutex mutex;
//Condition to wait when the queue is empty
boost::interprocess::interprocess_condition cond_empty;
//Condition to wait when the queue is full
boost::interprocess::interprocess_condition cond_full;

Related

Why is there no std:: equivalent to pthread_spinlock_t like there is for pthread_mutex_t & std::mutex?

I've used pthreads a fair bit for concurrent programs, mainly utilising spinlocks, mutexes, and condition variables.
I started looking into multithreading using std::thread and using std::mutex, and I noticed that there doesn't seem to be an equivalent to spinlock in pthreads.
Anyone know why this is?

there doesn't seem to be an equivalent to spinlock in pthreads.
Spinlocks are often considered a wrong tool in user-space because there is no way to disable thread preemption while the spinlock is held (unlike in kernel). So that a thread can acquire a spinlock and then get preempted, causing all other threads trying to acquire the spinlock to spin unnecessarily (and if those threads are of higher priority that may cause a deadlock (threads waiting for I/O may get a priority boost on wake up)). This reasoning also applies to all lockless data structures, unless the data structure is truly wait-free (there aren't many practically useful ones, apart from boost::spsc_queue).
In kernel, a thread that has locked a spinlock cannot be preempted or interrupted before it releases the spinlock. And that is why spinlocks are appropriate there (when RCU cannot be used).
On Linux, one can prevent preemption (not sure if completely, but there has been recent kernel changes towards such a desirable effect) by using isolated CPU cores and FIFO real-time threads pinned to those isolated cores. But that requires a deliberate kernel/machine configuration and an application designed to take advantage of that configuration. Nevertheless, people do use such a setup for business-critical applications along with lockless (but not wait-free) data structures in user-space.
On Linux, there is adaptive mutex PTHREAD_MUTEX_ADAPTIVE_NP, which spins for a limited number of iterations before blocking in the kernel (similar to InitializeCriticalSectionAndSpinCount). However, that mutex cannot be used through std::mutex interface because there is no option to customise non-portable pthread_mutexattr_t before initialising pthread_mutex_t.
One can neither enable process-sharing, robostness, error-checking or priority-inversion prevention through std::mutex interface. In practice, people write their own wrappers of pthread_mutex_t which allows to set desirable mutex attributes; along with a corresponding wrapper for condition variables. Standard locks like std::unique_lock and std::lock_guard can be reused.
IMO, there could be provisions to set desirable mutex and condition variable properties in std:: APIs, like providing a protected constructor for derived classes that would initialize that native_handle, but there aren't any. That native_handle looks like a good idea to do platform specific stuff, however, there must be a constructor for the derived class to be able to initialize it appropriately. After the mutex or condition variable is initialized that native_handle is pretty much useless. Unless the idea was only to be able to pass that native_handle to (C language) APIs that expect a pointer or reference to an initialized pthread_mutex_t.
There is another example of Boost/C++ standard not accepting semaphores on the basis that they are too much of a rope to hang oneself, and that mutex (a binary semaphore, essentially) and condition variable are more fundamental and more flexible synchronisation primitives, out of which a semaphore can be built.
From the point of view of the C++ standard those are probably right decisions because educating users to use spinlocks and semaphores correctly with all the nuances is a difficult task. Whereas advanced users can whip out a wrapper for pthread_spinlock_t with little effort.

You are right there's no spin lock implementation in the std namespace. A spin lock is a great concept but in user space is generally quite poor. OS doesn't know your process wants to spin and usually you can have worse results than using a mutex. To be noted that on several platforms there's the optimistic spinning implemented so a mutex can do a really good job. In addition adjusting the time to "pause" between each loop iteration can be not trivial and portable and a fine tuning is required. TL;DR don't use a spinlock in user space unless you are really really sure about what you are doing.
C++ Thread discussion
Article explaining how to write a spin lock with benchmark
Reply by Linus Torvalds about the above article explaining why it's a bad idea

Spin locks have two advantages:
They require much fewer storage as a std::mutex, because they do not need a queue of threads waiting for the lock. On my system, sizeof(pthread_spinlock_t) is 4, while sizeof(std::mutex) is 40.
They are much more performant than std::mutex, if the protected code region is small and the contention level is low to moderate.
On the downside, a poorly implemented spin lock can hog the CPU. For example, a tight loop with a compare-and-set assembler instructions will spam the cache system with loads and loads of unnecessary writes. But that's what we have libraries for, that they implement best practice and avoid common pitfalls. That most user implementations of spin locks are poor, is not a reason to not put spin locks into the library. Rather, it is a reason to put it there, to stop users from trying it themselves.
There is a second problem, that arises from the scheduler: If thread A acquires the lock and then gets preempted by the scheduler before it finishes executing the critical section, another thread B could spin "forever" (or at least for many milliseconds, before thread A gets scheduled again) on that lock.
Unfortunately, there is no way, how userland code can tell the kernel "please don't preempt me in this critical code section". But if we know, that under normal circumstances, the critical code section executes within 10 ns, we could at least tell thread B: "preempt yourself voluntarily, if you have been spinning for over 30 ns". This is not guaranteed to return control directly back to thread A. But it will stop the waste of CPU cycles, that otherwise would take place. And in most scenarios, where thread A and B run in the same process at the same priority, the scheduler will usually schedule thread A before thread B, if B called std::this_thread::yield().
So, I am thinking about a template spin lock class, that takes a single unsigned integer as a parameter, which is the number of memory reads in the critical section. This parameter is then used in the library to calculate the appropriate number of spins, before a yield() is performed. With a zero count, yield() would never be called.

Robust rwlock in posix

Posix provides a mechanism for a mutex to be marked as "robust", allowing multi-processes systems to recover gracefully from the crash of a process holding a mutex.
pthread_mutexattr_setrobust(&mutexattr, PTHREAD_MUTEX_ROBUST);
http://man7.org/linux/man-pages/man3/pthread_mutexattr_setrobust.3.html
However, there doesn't seem to be an equivalent for rwlock (reader-writer locks).
How can a process gracefully recover from a process crashing while holding a rwlock?

Implementing a robust rwlock is actually quite difficult due to the "concurrent readers" property - a rwlock with bounded storage but an unbounded number of concurrent readers fundamentally cannot track who its readers are, so if knowledge of who the current readers are is to be kept (in order to decrement the current read lock count when a reader dies), it must be the reader tasks themselves, not the rwlock, which are aware of their ownership of it. I don't see any obvious way it can be built on top of robust mutexes, or on top of the underlying mechanisms (like robust_list on Linux) typically used to implement robust mutexes.
If you really need robust rwlock semantics, you're probably better off having some sort of protocol with a dedicated coordinator process that's assumed not to die, that tracks death of clients via closure of a pipe/socket to them and is able to tell via shared memory contents whether the process that died held a read lock. Note that this still involves implementing your own sort of rwlock.

How can a process gracefully recover from a process crashing while holding a rwlock?
POSIX does not define a robustness option for its rwlocks. If a process dies while holding one locked, the lock is not recoverable -- you cannot even pthread_rwlock_destroy it. You also cannot expect any processes blocked trying to acquire the lock to unblock until their timeout, if any, expires. Threads blocked trying unconditionally and without a timeout to acquire the lock cannot be unblocked even by delivering a signal, because POSIX specifies that if their wait is interrupted by a signal, they will resume blocking after the signal handler finishes.
Therefore, in the case that one of several cooperating processes dies while holding a shared rwlock locked, whether for reading or writing, the best possible result is probably to arrange for the other cooperating processes to shut down as cleanly as possible. You shhould be able to arrange for something like that via a separate monitoring process that sends SIGTERMs to the others when a locking failure occurs. There would need to be much more to it than that, though, probably including some kind of extra discipline or wrapping around acquiring the rwlock.
Honestly, if you want robustness then you're probably better off rolling your own read/write lock using, for example, POSIX robust mutexes and POSIX condition variables, as #R.. described in comments.
Consider also that robustness of the lock itself is only half the picture. If a thread dies while holding the write lock, then you have the additional issue of ensuring the integrity of the data protected by the lock before continuing.

Futex throughput on Linux

I have an async API which wraps some IO library. The library uses C style callbacks, the API is C++, so natural choice (IMHO) was to use std::future/std::promise to build this API. Something like std::future<void> Read(uint64_t addr, byte* buff, uint64_t buffSize). However, when I was testing the implementation I saw that the bottleneck is the future/promise, more precisely, the futex used to implement promise/future. Since the futex, AFAIK, is user space and the fastest mechanism I know to sync two threads, I just switched to use raw futexes, which somewhat improved the situation, but not something drastic. The performance floating somewhere around 200k futex WAKEs per second. Then I stumbled upon this article - Futex Scaling for Multi-core Systems which quite matches the effect I observe with futexes. My questions is, since the futex too slow for me, what is the fastest mechanism on Linux I can use to wake the waiting side. I dont need anything more sophisticated than binary semaphore, just to signal IO operation completion. Since IO operations are very fast (tens of microseconds) switching to kernel mode not an option. Busy wait not an option too, since CPU time is precious in my case.
Bottom line, user space, simple synchronization primitive, shared between two threads only, only one thread sets the completion, only one thread waits for completion.
EDIT001:
What if... Previously I said, no spinning in busy wait. But futex already spins in busy wait, right? But the implementation covers more general case, which requests global hash table, to hold the futexes, queues for all subscribers etc. Is it a good idea to mimic same behavior on some simple entity (like int), no locks, no atomics, no global datastructures and busy wait on it like futex already does?

In my experience, the bottleneck is due to linux's poor support for IPC. This probably isn't a multicore scaling issue, unless you have a large number of threads.
When one thread wakes another (by futex or any other mechanism), the system tries to run the 'wakee' thread immediately. But the waker thread is still running and using a core, so the system will usually put the wakee thread on a different core. If that core was previously idle, then the system will have to wake the core up from a power-down state, which takes some time. Any data shared between the threads must now be transferred between the cores.
Then, the waker thread will usually wait for a response from the wakee (it sounds like this is what you are doing). So it immediately goes to sleep, and puts its core to idle.
Then a similar thing happens again when the response comes. The continuous CPU wakes and migrations cause the slowdown. You may well discover that if you launch many instances of your process simultaneously, so that all your cores are busy, you see increased performance as the CPUs no longer have to wake up, and the threads may stop migrating between cores. You can get a similar performance increase if you pin the two threads to one core - it will do more than 1 million 'pings'/sec in this case.
So isn't there a way of saying 'put this thread to sleep and then wake that one'? Then the OS could run the wakee on the same core as the waiter? Well, Google proposed a solution to this with a FUTEX_SWAP api that does exactly this, but has yet to be accepted into the linux kernel. The focus now seems to be on user-space thread control via User Managed Concurrency Groups which will hopefully be able to do something similar. However at the time of writing this is yet to be merged into the kernel.
Without these changes to the kernel, as far as I can tell there is no way around this problem. 'You are on the fastest route'! UNIX sockets, TCP loopback, pipes all suffer from the same issue. Futexes have the lowest overhead, which is why they go faster than the others. (with TCP you get about 100k pings per sec, about half the speed of a futex impl). Fixing this issue in a general way would benefit a lot of applications/deployments - anything that uses connections to localhost could benefit.
(I did try a DIY approach where the waker thread pins the wakee thread to the same core that the waker is on, but if you don't want to to pin the waker, then every time you post the futex you need to pin the wakee to the current thread, and the system call to do this has too much overhead)

C/C++ - ring buffer in shared memory (POSIX compatible)

I've an application where producers and consumers ("clients") want to send broadcast messages to each other, i.e. a n:m relationship. All could be different programs so they are different processes and not threads.
To reduce the n:m to something more maintainable I was thinking of a setup like introducing a little, central server. That server would offer an socket where each client connects to.
And each client would send a new message through that socket to the server - resulting in 1:n.
The server would also offer a shared memory that is read only for the clients. It would be organized as a ring buffer where the new messages would be added by the server and overwrite older ones.
This would give the clients some time to process the message - but if it's too slow it's bad luck, it wouldn't be relevant anymore anyway...
The advantage I see by this approach is that I avoid synchronisation as well as unnecessary data copying and buffer hierarchies, the central one should be enough, shouldn't it?
That's the architecture so far - I hope it makes sense...
Now to the more interesting aspect of implementing that:
The index of the newest element in the ring buffer is a variable in shared memory and the clients would just have to wait till it changes. Instead of a stupid while( central_index == my_last_processed_index ) { /* do nothing */ } I want to free CPU resources, e.g. by using a pthread_cond_wait().
But that needs a mutex that I think I don't need - on the other hand Why do pthreads’ condition variable functions require a mutex? gave me the impression that I'd better ask if my architecture makes sense and could be implemented like that...
Can you give me a hint if all of that makes sense and could work?
(Side note: the client programs could also be written in the common scripting languages like Perl and Python. So the communication with the server has to be recreated there and thus shouldn't be too complicated or even proprietary)

If memory serves, the reason for the mutex accompanying a condition variable is that under POSIX, signalling the condition variable causes the kernel to wake up all waiters on the condition variable. In these circumstances, the first thing that consumer threads need to do is check is that there is something to consume - by means of accessing a variable shared between producer and consumer threads. The mutex protects against concurrent access to the variable used for this purpose. This of course means that if there are many consumers, n-1 of them are needless awoken.
Having implemented precisely the arrangement described above, the choice of IPC object to use is not obvious. We were buffering audio between high priority real-time threads in separate processes, and didn't want to block the consumer. As the audio was produced and consumed in real-time, we were already getting scheduled regularly on both ends, and if there wasn't to consume (or space to produce into) we trashed the data because we'd already missed the deadline.
In the arrangement you describe, you will need a mutex to prevent the consumers concurrently consuming items that are queued (and believe me, on a lightly loaded SMP system, they will). However, you don't need to have the producer contend on this as well.
I don't understand you comment about the consumer having read-only access to the shared memory. In the classic lockless ring buffer implementation, the producer writes the queue tail pointer and the consumer(s) the head - whilst all parties need to be able to read both.
You might of course arrange for the queue head and tails to be in a different shared memory region to the queue data itself.
Also be aware that there is a theoretical data coherency hazard on SMP systems when implementing a ring buffer such as this - namely that write-back to memory of the queue content with respect to the head or tail pointer may occur out of order (they in cache - usually per-CPU core). There are other variants on this theme to do with synchonization of caches between CPUs. To guard against these, you need to an memory, load and store barriers to enforce ordering. See Memory Barrier on Wikipedia. You explicitly avoid this hazard by using kernel synchronisation primitives such as mutex and condition variables.
The C11 atomic operations can help with this.

You do need a mutex on a pthread_cond_wait() as far as I know. The reason is that pthread_cond_wait() is not atomic. The condition variable could change during the call, unless it's protected by a mutex.
It's possible that you can ignore this situation - the client might sleep past message 1, but when the subsequent message is sent then the client will wake up and find two messages to process. If that's unacceptable then use a mutex.

You probably can have a bit of different design by using sem_t if your system has them; some POSIX systems are still stuck on the 2001 version of POSIX.
You probably don't forcably need a mutex/condition pair. This is just how it was designed long time ago for POSIX.
Modern C, C11, and C++, C++11, now brings you (or will bring you) atomic operations, which were a feature that is implemented in all modern processors, but lacked support from most higher languages. Atomic operations are part of the answer for resolving a race condition for a ring buffer as you want to implement it. But they are not sufficient because with them you can only do active wait through polling, which is probably not what you want.
Linux, as an extension to POSIX, has futex that resolves both problems: to avoid races for updates by using atomic operations and the ability to putting waiters to sleep via a system call. Futexes are often considered as being too low level for everyday programming, but I think that it actually isn't too difficult to use them. I have written up things here.

Choosing between Critical Sections, Mutex and Spin Locks

What are the factors to keep in mind while choosing between Critical Sections, Mutex and Spin Locks? All of them provide for synchronization but are there any specific guidelines on when to use what?
EDIT: I did mean the windows platform as it has a notion of Critical Sections as a synchronization construct.

In Windows parlance, a critical section is a hybrid between a spin lock and a non-busy wait. It spins for a short time, then--if it hasn't yet grabbed the resource--it sets up an event and waits on it. If contention for the resource is low, the spin lock behavior is usually enough.
Critical Sections are a good choice for a multithreaded program that doesn't need to worry about sharing resources with other processes.
A mutex is a good general-purpose lock. A named mutex can be used to control access among multiple processes. But it's usually a little more expensive to take a mutex than a critical section.

General points to consider:
The performance cost of using the mechanism.
The complexity introduced by using the mechanism.
In any given situation 1 or 2 may be more important.
E.g.
If you using multi-threading to write a high performance algorithm by making use of many cores and need to guard some data for safe access then 1 is probably very important.
If you have an application where a background thread is used to poll for some information on a timer and on the rare occasion it notices an update you need to guard some data for access then 2 is probably more important than 1.
1 will be down to the underlying implementation and probably scales with the scope of the protection e.g. a lock that is internal to a process is normally faster than a lock across all processes on a machine.
2 is easy to misjudge. First attempts to use locks to write thread safe code will normally miss some cases that lead to a deadlock. A simple deadlock would occur for example if thread A was waiting on a lock held by thread B but thread B was waiting on a lock held by thread A. Surprisingly easy to implement by accident.
On any given platform the naming and qualities of locking mechanisms may vary.
On windows critical sections are fast and process specific, mutexes are slower but cross process. Semaphores offer more complicated use cases. Some problems e.g. allocation from a pool may be solved very efficently using atomic functions rather than locks e.g. on windows InterlockedIncrement which is very fast indeed.

A Mutex in Windows is actually an interprocess concurrency mechanism, making it incredibly slow when used for intraprocess threading. A Critical Section is the Windows analogue to the mutex you normally think of.
Spin Locks are best used when the resource being contested is usually not held for a significant number of cycles, meaning the thread that has the lock is probably going to give it up soon.
EDIT : My answer is only relevant provided you mean 'On Windows', so hopefully that's what you meant.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js