C++ prioritize data synchronization between threads - c++

I have a scenario, where I have a shared data model between several threads. Some threads are going to write to that data model cyclically and other threads are reading from that data model cyclically. But it is guaranteed that writer threads are only writing and reader threads are only reading.
Now the scenario is, that reading data shall have higher priority than writing data due to real time constraints on the reader side. So it is not acceptable that e.g. a writer is locking the data for a too long time. But a lock with a guaranteed locking time would be acceptable (e.g. it would be acceptable for the reader to wait max 1 ms until the data is synchronized and available).
So I'm wondering how this is achievable, because the "traditional" locking mechanisms (e.g. std::lock) wouldn't give those real time guarantees.

Normally in such a scenario you use a reader-writer-lock. This allows either a read by all readers in parallel or a write by a single writer.
But that does nothing to stop a writer from holding the lock for minutes if it so desires. Forcing the writer out of the lock is probably also not a good idea. The object is probably in some inconsistent state mid changed.
There is another synchronization method called read-copy-update that might help. This allows writers to modify element without being blocked by readers. The drawback is that you might get some readers still reading the old data and others reading the new data for some time.
It also might be problematic with multiple writers if they try to change the same member. The slower writer might have computed all the needed updates only to notice some other thread changes the object. It then has to start over wasting all the time it already spend.
Note: copying the element can be done in constant time, certainly under 1ms. So you can guarantee readers are never blocked for long. By releasing the write lock first you guarantee readers to read between any 2 writes, assuming the RW lock is designed with the same principle.
So I would suggest another solution I call write-intent-locking:
You start with a RW lock but add a lock to handle write-intent. Any writer can acquire the write-intent lock at any time, but only one of them, it's exclusive. Once a write holds the write-intent lock it copies the element
and starts modifying the copy. It can take as long as it wants to do that as it's not blocking any readers. It does block other writers though.
When all the modifications are done the writer acquires the write lock and then quickly copies, moves or replaces the element with the prepared copy. It then releases the write and write-intent lock, unblocking both the readers and writers that want to access the same element.

The way I would approach this is to have two identical copies of the dataset; call them copy A and copy B.
Readers always read from copy B, being careful to lock a reader/writer lock in read-only mode before accessing it.
When a writer-thread wants to update the dataset, it locks copy A (using a regular mutex) and updates it. The writer-thread can take as long as it likes to do this, because no readers are using copy A.
When the writer-thread is done updating copy A, it locks the reader/writer lock (in exclusive/writer-lock mode) and swaps dataset A with dataset B. (This swap should be done by exchanging pointers, and is therefore O(1) fast).
The writer-thread then unlocks the reader/writer-lock (so that any waiting reader-threads can now access the updated data-set), and then updates the other data-set the same way it updated the first data-set. This can also take as long as the writer-thread likes, since no reader-threads are waiting on this dataset anymore.
Finally the writer-thread unlocks the regular mutex, and we're done.

Well, you've got readers, and you've got writers, and you need a lock, so.... how about a readers/writer lock?
The reason I mention that up-front is because (a) you might not be aware of it, but more importantly (b) there's no standard RW lock in C++ (EDIT: my mistake, one was added in C++14), so your thinking about this is perhaps being done in the context of std::mutex. Once you've decided to go with a RW lock, you can benefit from other people's thinking about those locks.
In particular, there's a number of different options for prioritizing threads contending over RW locks. With one option, a thread acquiring a write lock waits until all current reader threads drop the lock, but readers who start waiting after the writer don't get the lock until the writer's done with it.
With that strategy, as long as the writer thread releases and reacquires the lock after each transaction, and as long as the writer completes each transaction within your 1 ms target, readers don't starve.
And if your writer can't promise that, then there is zero alternative but to redesign the writer: either doing more processing before acquiring the lock, or splitting a transaction into multiple pieces where it's safe to drop the lock between each.
If, on the other hand, your writer's transactions take much less than 1 ms, then you might consider skipping the release/reacquire between each one if less than 1 ms has elapsed (purely to reduce the processing overhead of doing so).... but I wouldn't advise it. Adding complexity and special cases and (shudder) wall clock time to your implementation is rarely the most practical way to maximize performance, and rapidly increases the risk of bugs. A simple multithreading system is a reliable multithreading system.

If model allows writing to be interrupted, then it also allows buffering. Use a fifo queue and start reading only when there are 50 elements written already. Use (smart)pointers to swap data in fifo queue. Swapping 8 bytes of pointer takes nanoseconds. Since there is buffering, writing will be on a different element than readers are working with so there wont be lock contention as long as producer can keep the pace with producers.
Why doesn't the reader produce its own consumer data? If you can have n producers and n consumers, each consumer can produce its own data too, without any producer. But this will have different multithread scaling. Maybe your algorithm is not applicable here but if it is, it would be more like independent multi-processing instead of multi-threading.
Writer work can be converted to multiple smaller jobs? Progress within writer can be reported to an atomic counter. When a reader has a waiting budget, it checks atomic value and if it looks slow, it can use same atomic value to instantly push it to 100% progress and writer sees it and early-quits lock.

Related

Reader/Writer: multiple heavy readers, only 1 write per day

I have a huge tbb::concurrent_unordered_map that gets "read" heavily by multiple (~60) threads concurrently.
Once per day I need to clear it (either fully, or selectively). Erasing is obviously not thread safe in tbb implementation, so some synchronisation needs to be in place to prevent UB.
However I know for a fact that the "write" will only happen once per day (the exact time is unknown).
I've been looking at std::shared_mutexto allow concurrent reads but I am afraid that even in an uncontended scenario might slow things significantly.
Is there a better solution for this?
Perhaps checking a std::atomic<bool> before locking on the mutex?
Thanks in advance.
It might require a bit of extra work on maintaining it, but you can use copy-on-write scheme.
Keep the map in a singleton within a shared pointer.
For "read" operations, have users thread-safely copy the shared pointer and use it for as long as they want.
For "write" operations, create a new instance map in a new shared pointer, fill it with whatever you want and replace the old version it in the singleton.
This way "read" users will still see the old version and can use it safely. Just make sure they occasionally get the newest version from the singleton. Perhaps, even give them a handle that automatically updates the shared pointer once a second or something.
This works in case you don't need the threads to synchronously update all at once.
Another scheme, you create atomic boolean to indicate when an update is incoming, and just make all threads pause their operations on the map when it is true. Once they all stopped you perform the update and let them resume their operation.
This is a perfect job for a read/write lock.
In C++ this can be implemented by having a shared_mutex, then using a unique_lock to lock it for writing, and a shared_lock to lock it for reading. See this post for a example code.
The end effect is that readers will never block on eachother, reads can all happen at the same time, but if the writer has the lock, everything will block to let the writing operation proceed.
If the writing takes a long time, so long that the once-per-day delay is unacceptable, then you can have the writer create and populate a new copy of the data without taking a lock, then take the write end of the lock and swap the data:
Readers:
Lock mutex with a shared_lock
Do stuff
Unlock
Repeat
Writer:
Create new copy of data
Lock mutex with a unique_lock
Swap data quickly
Unlock
Repeat tomorrow
A shared_lock on a shared_mutex will be fast. You could use a double check locking strategy but I would not do that until you do performance testing and also probably take a look at the source for shared_lock, because I suspect it does something similar already, and a double-check on the read end might just add overhead unnecessarily. Also I don't have enough coffee in me yet to work out double check locking in a read/write lock scenario.
There is a threading construct called a spin lock as well, but it's really just an encapsulation of a double-checked lock that repeats the "check" until it clears. It's a good construct but again you'll want performance analyses and a look at the shared_lock + shared_mutex source, because they might spin already. A good implementation of a spin lock can be found here, it covers some common gotchas. You might have to tweak it to get a read/write spinlock.
Generally speaking though, it's best to use existing constructs during the initial implementation at the very least as a clearly coded proof-of-concept. Then if you know that you're seeing too much read contention, you can optimize from there. But you need the general strategy down first, and 91 times out of a hundred, it's good enough. In this case, no matter what, some manifestation of a read/write lock is what you're going to end up with.

What does a read lock do in C++?

When using a shared_mutex, there is exclusive access and shared access. Exclusive access only allows one thread to access the resource while others are blocked until the thread holding the lock releases the lock. A shared access is when multiple threads are allowed to access the resource but under a "read lock". What is a "read lock"? I don't understand the meaning of a read lock. Can someone give code examples of a "read lock".
My guess of a read lock: I thought that a read lock only allows threads to run the code and not modify anything. Apparently I was wrong when I tried some code and it didn't work as I thought.
Again, can someone help me understand what a read lock is.
Thanks.
Your guess is very close to being right but there's a slight subtlety.
The read/write distinction shows intent of the locker, it does not limit what the locker actually does. If a read-locker decides to modify the protected resource, it can, and hilarity will ensue. However, that's no different to a thread modifying the protected resource with no lock at all. The programmer must follow the rules, or pay the price.
As to the distinction, a read lock is for when you only want to read the shared resource. A billion different threads can do this safely because the resource is not changing while they do it.
A write lock, on the other hand, means you want to change the shared resource, so there should be no readers (or other writers) with an active lock.
It's a way of optimising locks based on usage. For example, consider the following lock queue, with the earliest entries on the right, and no queue jumping (readers jumping in front of writers if the writer is currently blocked) allowed:
(queue here) -> RR W RRRR -> (action here)
The "resource controller" can allow all four R lockers at the queue head access at the same time.
The W locker will then have to wait until all of those locks are released before gaining access.
The final two R lockers must wait until the W locker releases but can then both get access concurrently.
In other words, the possible lock holders at any given time are (mutually exclusive):
any number of readers.
one writer.
As an aside, there are other possible optimisation strategies you can use as well.
For example (as alluded to earlier), you may want a reader to be able to jump the queue if it looks like the writer would have to wait a while anyway (such as there being a large number of current readers).
That's fraught with danger since you could starve a write locker, but it can be done provided there are limits on the number of queue jumpers.
I have done it in the past. From memory (a long ago memory of an embedded OS built in BCPL for the 6809 CPU, which shows more of my age than I probably want to), the writer included with its lock request the maximum number of queue jumpers it was happy with (defaulting to zero), and the queueing of future readers took that into account.
A write-lock only allows one lock to be acquired at a time.
A read-lock only allows other read-locks to be acquired at the same time, but prevents write-locks.
If you try to acquire a write-lock, you will be forced to wait until every other lock (read or write) has been released. Once you have the write-lock, no one else can get a lock of either kind.
If you try to acquire a read-lock, you will have to wait for any write-lock to be released. If there is no write-lock, you will obtain the read-lock, regardless how many other threads have a read-lock.
Whether or not you modify data is entirely up to you to police. If you have a read-lock, you should not be modifying shared data.
A read lock is just another way of saying that a thread of execution has acquired shared ownership of the mutex. It says nothing about what that thread can actually do.
It is up to the programmer not to update whatever shared resource is being protected by the mutex under a shared lock. To do that, acquire an exclusive lock.

How to make windows slim read writer lock fair?

i found out that windows implemented a slim reader-writer-lock (see https://msdn.microsoft.com/en-us/library/windows/desktop/aa904937%28v=vs.85%29.aspx ). Unfortunately (for me) this rw-lock is neither fifo nor is it fair (in any sense).
Is there a possibility to make the windows rw-lock with some workaround fair or fifo?
If not, in which scenarios would you use the windows slim rw-lock?
It is unlikely you can change the slim lock itself to be fair, especially since the documentation doesn't indicate any method of doing so, and most locks today are unfair for performance reasons.
That said, it is fairly straightforward to roll your own approximately FIFO lock with Windows events, and a 64-bit control word that you manipulate with compare and swap that is still very slim. Here's an outline:
The state of the lock is reflected in the control word is manipulated atomically to transition between the states, and allows threads to enter the lock (if allowed) with a single atomic operation and no kernel switch (that's the performance part of "slim"). The reset events are used to notify waiting threads, when threads need to block and can be allocated on-demand (that's the low memory footprint of slim).
The lock control word has the follow states:
Free - no readers or writers, and no waiters. Any thread can acquire the lock for reading or writing by atomically transitioning the lock into state (2) or (3).
N readers in the lock. There are N readers in the lock at the moment. New readers can immediately acquire the lock by adding 1 to the count - use a field of 30-bits or so within the control word to represent this count. Writers must block (perhaps after spinning). When readers leave the lock, they decrement the count, which may transition to state (1) when the last reader leaves (although they don't need to do anything special in a (2) -> (1) transition).
State (2) + waiting writers + 0 or more waiting readers. In this state, there are 1 or more readers still in the lock, but at least one waiting writer. The writers should wait on a manual-reset event, which is designed, although not guaranteed, to be FIFO. There is a field in the control word to indicate how many writers are waiting. In this state, new readers that want to enter the lock cannot, and set a reader-waiting bit instead, and block on the reader-waiting event. New writers increment the waiting writer count and block on the writer-waiting event. When the last reader leaves (setting the reader-count field to 0), it signals the writer-waiting event, releasing the longest-waiting writer to enter the lock.
Writer in the lock. When a writer is in the lock, all readers queue up and wait on the reader-waiting event. All incoming writers increment the waiting-writer count and queue up as usual on the writer-waiting event. There may even be some waiting readers when the writer acquires the lock because of state (3) above, and these are treated identically. When the writer leaves the lock, it checks for waiting writers and readers and either unblocks a writer or all readers, depending on policy, discussed below.
All the state transitions discussed above are done atomically using compare-and-swap. The typical pattern is that any of the lock() or unlock() calls look at the control word, determine what state they are in and what transition needs to happen (following the rules above), calculate the new control word in a temporary then attempt to swap in the new control word with compare-and-swap. Sometimes that attempt fails because another thread concurrently modified the control word (e.g., another reader entered the lock, incrementing the reader count). No problem, just start over from "determine state..." and try again. Such races are rare in practice since the state word calculation is very short, and that's just how things work with CAS-based complex locks.
This lock design is "slim" is almost every sense. Performance-wise, it is near the top of what you can get for a general purpose design. In particular, the common fast-paths of (a) reader entering the lock with 0 or more readers already in the block (b) reader leaving the lock with 0 or more readers still in the lock and (c) writer entering/leaving an uncontended lock are all about as fast as possible in the usual case: a single atomic operation. Furthermore, the reader entry and exit paths are "lock free" in the sense that incoming readers do not temporarily take an mutex internal to the rwlock, manipulate state, and then unlock it while entering/leaving the lock. This approach is slow and subject to issues whenever a reader thread performs a context switch at the critical moment in holds the internal lock. Such approaches do not scale to heaver reader activity with a short rwlock critical section: even though multiple readers can, in theory, enter the critical section, they all bottleneck on entering and leaving the internal lock (which happens twice for every enter/exit operation) and performance is worse than a normal mutex!
It is also lightweight in that it only needs a couple of Windows Event objects, and these objects can be allocated on demand - they are only needed when contention occurs and a state transition that requires blocking is about to occur. That's how CRITICAL_SECTION objects work.
The lock above is fair in the sense that readers won't block writers, and writers are served in FIFO order. How writers interact with waiting readers is up to your policy for who to unblock when the lock becomes free after a writer unlocks and there are both waiting readers and writers. On simple policy is to unblock all waiting readers.
In this design, writers will alternate in FIFO order with FIFO batches of readers. Writers are FIFO relative to other writers, and reader batches are FIFO relative to other reader batches, but the relationship between writers and readers isn't exactly FIFO: because all incoming readers are added to the same reader-waiting set, in the case that there are already several waiting writers, arriving readers all go into the next "batch" to be released, which actually puts them ahead of writers that are already waiting. That's quite reasonable though: readers all go at once, so adding more readers to the batch doesn't necessary cost much and probably increases efficiency, and if you did serve everything thread in strict FIFO order, the lock would reduce in behavior to a simple mutex under contention.
Another possible design is to always unblock writers if any are waiting. This favors writers at the expense of readers and does mean that a never-ending stream of writers could block out readers indefinitely. This approach makes sense where you know your writes are latency sensitive important and you either aren't worried about reader starvation, or you know it can't occur due to the design of your application (e.g., because there is only one possible writer at a time).
Beyond that, there are a lot of other policies possible, such as favoring writers up until readers have been waiting for a certain period, or limiting reader batch sizes, or whatever. They are mostly possible to implement efficiently since the bookkeeping is generally limited to the slow paths where threads will block anyway.
I've glossed over some implementation details and gotchas here (in particular, the need to be careful when making the transitions that involve blocking to avoid "missed wakeup" issues) - but this definitely works. I've written such a lock before the slim rwlock existed to fill the need for a fast high-performance rwlock, and it performs very well. Other tradeoffs are possible too, e.g., for designs in which reads are expected to dominate, contention can be reduced by splitting up the control word across cache lines, at the cost of more expensive write operations.
One final note - this lock is a bit fatter, in memory use, than the Windows one in the case that is contended - because it allocates one or two windows Events per lock, while the slim lock avoids this. The slim lock likely does it by directly supporting the slim lock behavior in the kernel, so the control word can directly be used as part of the kernel-side waitlist. You can't reproduce that exactly, but you can still remove the per-lock overhead in another way: use thread-local storage to allocate your two events per thread rather than per lock. Since a thread can only be waiting on one lock at a time, you only need this structure one per thread. That brings it into line with slim lock in memory use (unless you have very few locks and a ton of threads).
this rw-lock is neither fifo nor is it fair
I wouldn't expect anything to do with threading to be "fair" or "fifo" unless it said it was explicitly. In this case, I would expect writing locks to take priority, as it might never be able to obtain a lock if there are a lot of reading threads, but then I also wouldn't assume that's the case, and I've not read the MS documentation for a better understanding.
That said, it sounds like your issue is that you have a lot of contention on the lock, caused by write threads; because otherwise your readers would always be able to share the lock. You should probably examine why your writing thread(s) are trying to hold the lock for so long; buffering up adds for example will help mitigate this.

Thread read versus write locking

I'm writing a CPU intensive program in C++ that has several threads needing to access a shared data structure, so locking will be required. To maximize throughput, I want to keep the bottleneck to a minimum. It looks like maybe nine times out of ten it will only be necessary to read the data structure, and one time out of ten it will be necessary to modify it.
Is there a way to have threads take read or write locks, so that write locks block everything but read locks don't block each other?
A portable solution would be ideal, but if there is one solution for Windows and another for Linux that would be okay.
Yes, this is a common situation that can be solved with a reader-writer lock.
Note that depending on the dynamic properties of your program, you may need to be careful about writer starvation. If there are enough readers that their attempts to read always overlap (or overlap for a long time), then a simple implementation of a reader-writer lock will "starve" the writer by making the writer wait until there are no readers reading. In a more advanced implementation, a writer request will be conceptually inserted into the queue before subsequent readers, allowing the writer to have a chance to access after all the previously active readers finish.
Most implementations require you to know ahead of time whether you want a read lock or a write lock. Some implementations allow you to "upgrade" a read lock into a write lock without having to release the read lock first (which would give another writer an opportunity to enter the lock).

Fast thread syncronization

I'm working on a system that have multiple threads and one shared object. There is a number of threads that do read operations very often, but write operations are rare, maybe 3 to 5 per day.
I'm using rwlock for synchronization but the lock acquisition operation it's not fast enough since it happens all the time. So, I'm looking for a faster way of doing it.
Maybe a way of making the write function atomic or looking all threads during the write. Portability it's not a hard requirement, I'm using Linux with GCC 4.6.
Have you considered using read-copy-update with liburcu? This lets you avoid atomic operations and locking entirely on the read path, at the expense of making writes quite a bit slower. Note that some readers might see stale data for a short time, though; if you need the update to take effect immediately, it may not be the best option for you.
You might want to use a multiple objects rather than a single one. Instead of actually sharing the object, create an object that holds the object and an atomic count, then share a pointer to this structure among the threads.
[Assuming that there is only one writer] Each reader will get the pointer, then atomically increment the counter and use the object, after reading, atomically decrement the counter. The writer will create a new object that holds a copy of the original and modify it. Then perform an atomic swap of the two pointers. Now the problem is releasing the old object, which is why you need the count of the readers. The writer needs to continue checking the count of the old object until all readers have completed the work at which point you can delete the old object.
If there are multiple writers (i.e. there can be more than one thread updating the variable) you can follow the same approach but with writers would need to do a compare-and-swap exchange of the pointer. If the pointer from which the updated copy has changed, then the writer restarts the process (deletes it's new object, copies again from the pointer and retries the CAS)
Maybe you could use a spinlock, the threads will busy wait until unlocked. If the threads aren't locked for long it can be much more efficent than mutexes since the locking and unlocking is completed with less instructions.
spinlock is a part of POSIX pthread although optional so I don't know if it's implemented on your system. I used them in a C program on ubuntu but had to compile with -std=gnu99 instead of c99.