How to make windows slim read writer lock fair? - c++

i found out that windows implemented a slim reader-writer-lock (see https://msdn.microsoft.com/en-us/library/windows/desktop/aa904937%28v=vs.85%29.aspx ). Unfortunately (for me) this rw-lock is neither fifo nor is it fair (in any sense).
Is there a possibility to make the windows rw-lock with some workaround fair or fifo?
If not, in which scenarios would you use the windows slim rw-lock?

It is unlikely you can change the slim lock itself to be fair, especially since the documentation doesn't indicate any method of doing so, and most locks today are unfair for performance reasons.
That said, it is fairly straightforward to roll your own approximately FIFO lock with Windows events, and a 64-bit control word that you manipulate with compare and swap that is still very slim. Here's an outline:
The state of the lock is reflected in the control word is manipulated atomically to transition between the states, and allows threads to enter the lock (if allowed) with a single atomic operation and no kernel switch (that's the performance part of "slim"). The reset events are used to notify waiting threads, when threads need to block and can be allocated on-demand (that's the low memory footprint of slim).
The lock control word has the follow states:
Free - no readers or writers, and no waiters. Any thread can acquire the lock for reading or writing by atomically transitioning the lock into state (2) or (3).
N readers in the lock. There are N readers in the lock at the moment. New readers can immediately acquire the lock by adding 1 to the count - use a field of 30-bits or so within the control word to represent this count. Writers must block (perhaps after spinning). When readers leave the lock, they decrement the count, which may transition to state (1) when the last reader leaves (although they don't need to do anything special in a (2) -> (1) transition).
State (2) + waiting writers + 0 or more waiting readers. In this state, there are 1 or more readers still in the lock, but at least one waiting writer. The writers should wait on a manual-reset event, which is designed, although not guaranteed, to be FIFO. There is a field in the control word to indicate how many writers are waiting. In this state, new readers that want to enter the lock cannot, and set a reader-waiting bit instead, and block on the reader-waiting event. New writers increment the waiting writer count and block on the writer-waiting event. When the last reader leaves (setting the reader-count field to 0), it signals the writer-waiting event, releasing the longest-waiting writer to enter the lock.
Writer in the lock. When a writer is in the lock, all readers queue up and wait on the reader-waiting event. All incoming writers increment the waiting-writer count and queue up as usual on the writer-waiting event. There may even be some waiting readers when the writer acquires the lock because of state (3) above, and these are treated identically. When the writer leaves the lock, it checks for waiting writers and readers and either unblocks a writer or all readers, depending on policy, discussed below.
All the state transitions discussed above are done atomically using compare-and-swap. The typical pattern is that any of the lock() or unlock() calls look at the control word, determine what state they are in and what transition needs to happen (following the rules above), calculate the new control word in a temporary then attempt to swap in the new control word with compare-and-swap. Sometimes that attempt fails because another thread concurrently modified the control word (e.g., another reader entered the lock, incrementing the reader count). No problem, just start over from "determine state..." and try again. Such races are rare in practice since the state word calculation is very short, and that's just how things work with CAS-based complex locks.
This lock design is "slim" is almost every sense. Performance-wise, it is near the top of what you can get for a general purpose design. In particular, the common fast-paths of (a) reader entering the lock with 0 or more readers already in the block (b) reader leaving the lock with 0 or more readers still in the lock and (c) writer entering/leaving an uncontended lock are all about as fast as possible in the usual case: a single atomic operation. Furthermore, the reader entry and exit paths are "lock free" in the sense that incoming readers do not temporarily take an mutex internal to the rwlock, manipulate state, and then unlock it while entering/leaving the lock. This approach is slow and subject to issues whenever a reader thread performs a context switch at the critical moment in holds the internal lock. Such approaches do not scale to heaver reader activity with a short rwlock critical section: even though multiple readers can, in theory, enter the critical section, they all bottleneck on entering and leaving the internal lock (which happens twice for every enter/exit operation) and performance is worse than a normal mutex!
It is also lightweight in that it only needs a couple of Windows Event objects, and these objects can be allocated on demand - they are only needed when contention occurs and a state transition that requires blocking is about to occur. That's how CRITICAL_SECTION objects work.
The lock above is fair in the sense that readers won't block writers, and writers are served in FIFO order. How writers interact with waiting readers is up to your policy for who to unblock when the lock becomes free after a writer unlocks and there are both waiting readers and writers. On simple policy is to unblock all waiting readers.
In this design, writers will alternate in FIFO order with FIFO batches of readers. Writers are FIFO relative to other writers, and reader batches are FIFO relative to other reader batches, but the relationship between writers and readers isn't exactly FIFO: because all incoming readers are added to the same reader-waiting set, in the case that there are already several waiting writers, arriving readers all go into the next "batch" to be released, which actually puts them ahead of writers that are already waiting. That's quite reasonable though: readers all go at once, so adding more readers to the batch doesn't necessary cost much and probably increases efficiency, and if you did serve everything thread in strict FIFO order, the lock would reduce in behavior to a simple mutex under contention.
Another possible design is to always unblock writers if any are waiting. This favors writers at the expense of readers and does mean that a never-ending stream of writers could block out readers indefinitely. This approach makes sense where you know your writes are latency sensitive important and you either aren't worried about reader starvation, or you know it can't occur due to the design of your application (e.g., because there is only one possible writer at a time).
Beyond that, there are a lot of other policies possible, such as favoring writers up until readers have been waiting for a certain period, or limiting reader batch sizes, or whatever. They are mostly possible to implement efficiently since the bookkeeping is generally limited to the slow paths where threads will block anyway.
I've glossed over some implementation details and gotchas here (in particular, the need to be careful when making the transitions that involve blocking to avoid "missed wakeup" issues) - but this definitely works. I've written such a lock before the slim rwlock existed to fill the need for a fast high-performance rwlock, and it performs very well. Other tradeoffs are possible too, e.g., for designs in which reads are expected to dominate, contention can be reduced by splitting up the control word across cache lines, at the cost of more expensive write operations.
One final note - this lock is a bit fatter, in memory use, than the Windows one in the case that is contended - because it allocates one or two windows Events per lock, while the slim lock avoids this. The slim lock likely does it by directly supporting the slim lock behavior in the kernel, so the control word can directly be used as part of the kernel-side waitlist. You can't reproduce that exactly, but you can still remove the per-lock overhead in another way: use thread-local storage to allocate your two events per thread rather than per lock. Since a thread can only be waiting on one lock at a time, you only need this structure one per thread. That brings it into line with slim lock in memory use (unless you have very few locks and a ton of threads).

this rw-lock is neither fifo nor is it fair
I wouldn't expect anything to do with threading to be "fair" or "fifo" unless it said it was explicitly. In this case, I would expect writing locks to take priority, as it might never be able to obtain a lock if there are a lot of reading threads, but then I also wouldn't assume that's the case, and I've not read the MS documentation for a better understanding.
That said, it sounds like your issue is that you have a lot of contention on the lock, caused by write threads; because otherwise your readers would always be able to share the lock. You should probably examine why your writing thread(s) are trying to hold the lock for so long; buffering up adds for example will help mitigate this.

Related

C++ prioritize data synchronization between threads

I have a scenario, where I have a shared data model between several threads. Some threads are going to write to that data model cyclically and other threads are reading from that data model cyclically. But it is guaranteed that writer threads are only writing and reader threads are only reading.
Now the scenario is, that reading data shall have higher priority than writing data due to real time constraints on the reader side. So it is not acceptable that e.g. a writer is locking the data for a too long time. But a lock with a guaranteed locking time would be acceptable (e.g. it would be acceptable for the reader to wait max 1 ms until the data is synchronized and available).
So I'm wondering how this is achievable, because the "traditional" locking mechanisms (e.g. std::lock) wouldn't give those real time guarantees.
Normally in such a scenario you use a reader-writer-lock. This allows either a read by all readers in parallel or a write by a single writer.
But that does nothing to stop a writer from holding the lock for minutes if it so desires. Forcing the writer out of the lock is probably also not a good idea. The object is probably in some inconsistent state mid changed.
There is another synchronization method called read-copy-update that might help. This allows writers to modify element without being blocked by readers. The drawback is that you might get some readers still reading the old data and others reading the new data for some time.
It also might be problematic with multiple writers if they try to change the same member. The slower writer might have computed all the needed updates only to notice some other thread changes the object. It then has to start over wasting all the time it already spend.
Note: copying the element can be done in constant time, certainly under 1ms. So you can guarantee readers are never blocked for long. By releasing the write lock first you guarantee readers to read between any 2 writes, assuming the RW lock is designed with the same principle.
So I would suggest another solution I call write-intent-locking:
You start with a RW lock but add a lock to handle write-intent. Any writer can acquire the write-intent lock at any time, but only one of them, it's exclusive. Once a write holds the write-intent lock it copies the element
and starts modifying the copy. It can take as long as it wants to do that as it's not blocking any readers. It does block other writers though.
When all the modifications are done the writer acquires the write lock and then quickly copies, moves or replaces the element with the prepared copy. It then releases the write and write-intent lock, unblocking both the readers and writers that want to access the same element.
The way I would approach this is to have two identical copies of the dataset; call them copy A and copy B.
Readers always read from copy B, being careful to lock a reader/writer lock in read-only mode before accessing it.
When a writer-thread wants to update the dataset, it locks copy A (using a regular mutex) and updates it. The writer-thread can take as long as it likes to do this, because no readers are using copy A.
When the writer-thread is done updating copy A, it locks the reader/writer lock (in exclusive/writer-lock mode) and swaps dataset A with dataset B. (This swap should be done by exchanging pointers, and is therefore O(1) fast).
The writer-thread then unlocks the reader/writer-lock (so that any waiting reader-threads can now access the updated data-set), and then updates the other data-set the same way it updated the first data-set. This can also take as long as the writer-thread likes, since no reader-threads are waiting on this dataset anymore.
Finally the writer-thread unlocks the regular mutex, and we're done.
Well, you've got readers, and you've got writers, and you need a lock, so.... how about a readers/writer lock?
The reason I mention that up-front is because (a) you might not be aware of it, but more importantly (b) there's no standard RW lock in C++ (EDIT: my mistake, one was added in C++14), so your thinking about this is perhaps being done in the context of std::mutex. Once you've decided to go with a RW lock, you can benefit from other people's thinking about those locks.
In particular, there's a number of different options for prioritizing threads contending over RW locks. With one option, a thread acquiring a write lock waits until all current reader threads drop the lock, but readers who start waiting after the writer don't get the lock until the writer's done with it.
With that strategy, as long as the writer thread releases and reacquires the lock after each transaction, and as long as the writer completes each transaction within your 1 ms target, readers don't starve.
And if your writer can't promise that, then there is zero alternative but to redesign the writer: either doing more processing before acquiring the lock, or splitting a transaction into multiple pieces where it's safe to drop the lock between each.
If, on the other hand, your writer's transactions take much less than 1 ms, then you might consider skipping the release/reacquire between each one if less than 1 ms has elapsed (purely to reduce the processing overhead of doing so).... but I wouldn't advise it. Adding complexity and special cases and (shudder) wall clock time to your implementation is rarely the most practical way to maximize performance, and rapidly increases the risk of bugs. A simple multithreading system is a reliable multithreading system.
If model allows writing to be interrupted, then it also allows buffering. Use a fifo queue and start reading only when there are 50 elements written already. Use (smart)pointers to swap data in fifo queue. Swapping 8 bytes of pointer takes nanoseconds. Since there is buffering, writing will be on a different element than readers are working with so there wont be lock contention as long as producer can keep the pace with producers.
Why doesn't the reader produce its own consumer data? If you can have n producers and n consumers, each consumer can produce its own data too, without any producer. But this will have different multithread scaling. Maybe your algorithm is not applicable here but if it is, it would be more like independent multi-processing instead of multi-threading.
Writer work can be converted to multiple smaller jobs? Progress within writer can be reported to an atomic counter. When a reader has a waiting budget, it checks atomic value and if it looks slow, it can use same atomic value to instantly push it to 100% progress and writer sees it and early-quits lock.

What does a read lock do in C++?

When using a shared_mutex, there is exclusive access and shared access. Exclusive access only allows one thread to access the resource while others are blocked until the thread holding the lock releases the lock. A shared access is when multiple threads are allowed to access the resource but under a "read lock". What is a "read lock"? I don't understand the meaning of a read lock. Can someone give code examples of a "read lock".
My guess of a read lock: I thought that a read lock only allows threads to run the code and not modify anything. Apparently I was wrong when I tried some code and it didn't work as I thought.
Again, can someone help me understand what a read lock is.
Thanks.
Your guess is very close to being right but there's a slight subtlety.
The read/write distinction shows intent of the locker, it does not limit what the locker actually does. If a read-locker decides to modify the protected resource, it can, and hilarity will ensue. However, that's no different to a thread modifying the protected resource with no lock at all. The programmer must follow the rules, or pay the price.
As to the distinction, a read lock is for when you only want to read the shared resource. A billion different threads can do this safely because the resource is not changing while they do it.
A write lock, on the other hand, means you want to change the shared resource, so there should be no readers (or other writers) with an active lock.
It's a way of optimising locks based on usage. For example, consider the following lock queue, with the earliest entries on the right, and no queue jumping (readers jumping in front of writers if the writer is currently blocked) allowed:
(queue here) -> RR W RRRR -> (action here)
The "resource controller" can allow all four R lockers at the queue head access at the same time.
The W locker will then have to wait until all of those locks are released before gaining access.
The final two R lockers must wait until the W locker releases but can then both get access concurrently.
In other words, the possible lock holders at any given time are (mutually exclusive):
any number of readers.
one writer.
As an aside, there are other possible optimisation strategies you can use as well.
For example (as alluded to earlier), you may want a reader to be able to jump the queue if it looks like the writer would have to wait a while anyway (such as there being a large number of current readers).
That's fraught with danger since you could starve a write locker, but it can be done provided there are limits on the number of queue jumpers.
I have done it in the past. From memory (a long ago memory of an embedded OS built in BCPL for the 6809 CPU, which shows more of my age than I probably want to), the writer included with its lock request the maximum number of queue jumpers it was happy with (defaulting to zero), and the queueing of future readers took that into account.
A write-lock only allows one lock to be acquired at a time.
A read-lock only allows other read-locks to be acquired at the same time, but prevents write-locks.
If you try to acquire a write-lock, you will be forced to wait until every other lock (read or write) has been released. Once you have the write-lock, no one else can get a lock of either kind.
If you try to acquire a read-lock, you will have to wait for any write-lock to be released. If there is no write-lock, you will obtain the read-lock, regardless how many other threads have a read-lock.
Whether or not you modify data is entirely up to you to police. If you have a read-lock, you should not be modifying shared data.
A read lock is just another way of saying that a thread of execution has acquired shared ownership of the mutex. It says nothing about what that thread can actually do.
It is up to the programmer not to update whatever shared resource is being protected by the mutex under a shared lock. To do that, acquire an exclusive lock.

Robust rwlock in posix

Posix provides a mechanism for a mutex to be marked as "robust", allowing multi-processes systems to recover gracefully from the crash of a process holding a mutex.
pthread_mutexattr_setrobust(&mutexattr, PTHREAD_MUTEX_ROBUST);
http://man7.org/linux/man-pages/man3/pthread_mutexattr_setrobust.3.html
However, there doesn't seem to be an equivalent for rwlock (reader-writer locks).
How can a process gracefully recover from a process crashing while holding a rwlock?
Implementing a robust rwlock is actually quite difficult due to the "concurrent readers" property - a rwlock with bounded storage but an unbounded number of concurrent readers fundamentally cannot track who its readers are, so if knowledge of who the current readers are is to be kept (in order to decrement the current read lock count when a reader dies), it must be the reader tasks themselves, not the rwlock, which are aware of their ownership of it. I don't see any obvious way it can be built on top of robust mutexes, or on top of the underlying mechanisms (like robust_list on Linux) typically used to implement robust mutexes.
If you really need robust rwlock semantics, you're probably better off having some sort of protocol with a dedicated coordinator process that's assumed not to die, that tracks death of clients via closure of a pipe/socket to them and is able to tell via shared memory contents whether the process that died held a read lock. Note that this still involves implementing your own sort of rwlock.
How can a process gracefully recover from a process crashing while holding a rwlock?
POSIX does not define a robustness option for its rwlocks. If a process dies while holding one locked, the lock is not recoverable -- you cannot even pthread_rwlock_destroy it. You also cannot expect any processes blocked trying to acquire the lock to unblock until their timeout, if any, expires. Threads blocked trying unconditionally and without a timeout to acquire the lock cannot be unblocked even by delivering a signal, because POSIX specifies that if their wait is interrupted by a signal, they will resume blocking after the signal handler finishes.
Therefore, in the case that one of several cooperating processes dies while holding a shared rwlock locked, whether for reading or writing, the best possible result is probably to arrange for the other cooperating processes to shut down as cleanly as possible. You shhould be able to arrange for something like that via a separate monitoring process that sends SIGTERMs to the others when a locking failure occurs. There would need to be much more to it than that, though, probably including some kind of extra discipline or wrapping around acquiring the rwlock.
Honestly, if you want robustness then you're probably better off rolling your own read/write lock using, for example, POSIX robust mutexes and POSIX condition variables, as #R.. described in comments.
Consider also that robustness of the lock itself is only half the picture. If a thread dies while holding the write lock, then you have the additional issue of ensuring the integrity of the data protected by the lock before continuing.

What is a scalable lock?

What is a scalable lock? And how is it different from a non-scalable lock? I first saw this term in the context of the TBB rw-lock, and couldn't decide which to use.
In addition, is there any rw-lock that prioritizes readers over writers?
There is no formal definition of the term "scalable lock" or "non-scalable lock". What it's meant to imply is that some locking algorithms, techniques or implementations perform reasonably well even when there is a lot of contention for the lock, and some do not.
Sometimes the problem is algorithmic. A naive implementation of priority inheritance, for example, may require O(n) work to release a lock, where n is the number of waiting threads. That implies O(n^2) work for every waiting thread to be serviced.
Sometimes the problem is to do with hardware. Simple spinlocks (e.g. implementations for which the lock cache line is shared and acquirers don't back off) don't scale on SMP hardware with a single bus interconnect, because writing to a cache line requires that the CPU acquires the cache line, and the CPU interconnect is a single point of contention. If there are n CPUs trying to acquire the same lock at the same time, you may end up with O(n) bus traffic to acquire the lock. Again, this means O(n^2) time for all n CPUs to be satisfied.
In general, you should avoid non-scalable locks unless two conditions are met:
Contention is light.
The critical section is very short.
You really have to know that the two conditions are met. A critical section may be short in terms of lines of code, but not be short in wall time. If in doubt, use a scalable lock, and later fix any which have been measured to cause performance problems.
As for your last question, I'm not aware of an off-the-shelf read-write lock which favours readers. Actually, most APIs don't specify policy, including pthreads (annoyingly).
My first comment is that you probably don't want it. If you have high contention, favouring one over the other kills throughput, and if you don't have high contention, it won't make a difference. About the only reason I can think not to use a rw lock with a completely fair policy is if you have thread priorities which must be respected, so you want the highest priority thread to be preferred.
But if you must, you could always roll your own. All you need is a couple of flags (one for "readers can go now" and one for "a writer can go now"), condition variables protecting the flags, a single mutex protecting the condition variables, and a couple of counters indicating how many readers and writers are waiting. That should be all you need; implementing this should be quite instructive.

Thread read versus write locking

I'm writing a CPU intensive program in C++ that has several threads needing to access a shared data structure, so locking will be required. To maximize throughput, I want to keep the bottleneck to a minimum. It looks like maybe nine times out of ten it will only be necessary to read the data structure, and one time out of ten it will be necessary to modify it.
Is there a way to have threads take read or write locks, so that write locks block everything but read locks don't block each other?
A portable solution would be ideal, but if there is one solution for Windows and another for Linux that would be okay.
Yes, this is a common situation that can be solved with a reader-writer lock.
Note that depending on the dynamic properties of your program, you may need to be careful about writer starvation. If there are enough readers that their attempts to read always overlap (or overlap for a long time), then a simple implementation of a reader-writer lock will "starve" the writer by making the writer wait until there are no readers reading. In a more advanced implementation, a writer request will be conceptually inserted into the queue before subsequent readers, allowing the writer to have a chance to access after all the previously active readers finish.
Most implementations require you to know ahead of time whether you want a read lock or a write lock. Some implementations allow you to "upgrade" a read lock into a write lock without having to release the read lock first (which would give another writer an opportunity to enter the lock).