Alternatives for locks for synchronisation - c++

I'm currently in the process of developing my own little threading library, mainly for learning purposes, and am at the part of the message queue which will involve a lot of synchronisation in various places. Previously I've mainly used locks, mutexes and condition variables a bit which all are variations of the same theme, a lock for a section that should only be used by one thread at a time.
Are there any different solutions to synchronisation than using locks? I've read lock-free synchronization at places, but some consider hiding the locks in containers to be lock-free, which I disagree with. you just don't explicitly use the locks yourself.

Lock-free algorithms typically involve using compare-and-swap (CAS) or similar CPU instructions that update some value in memory not only atomically, but also conditionally and with an indicator of success. That way you can code something like this:
1 do
2 {
3 current_value = the_varibale
4 new_value = ...some expression using current_value...
5 } while(!compare_and_swap(the_variable, current_value, new_value));
compare_and_swap() atomically checks whether the_variable's value is still current_value, and only if that's so will it update the_variable's value to new_value and return true
exact calling syntax will vary with the CPU, and may involve assembly language or system/compiler-provided wrapper functions (use the latter if available - there may be other compiler optimisations or issues that their usage restricts to safe behaviours); generally, check your docs
The significance is that when another thread updates the variable after the read on line 3 but before the CAS on line 5 attempts the update, the compare and swap instruction will fail because the state from which you're updating is not the one you used to calculate the desired target state. Such do/while loops can be said to "spin" rather than lock, as they go round and round the loop until CAS succeeds.
Crucially, your existing threading library can be expected to have a two-stage locking approach for mutex, read-write locks etc. involving:
First stage: spinning using CAS or similar (i.e. spin on { read the current value, if it's not set then cas(current = not set, new = set) }) - which means other threads doing a quick update often won't result in your thread swapping out to wait, and all the relatively time-consuming overheads associated with that.
The second stage is only used if some limit of loop iterations or elapsed time is exceeded: it asks the operating system to queue the thread until it knows (or at least suspects) the lock is free to acquire.
The implication of this is that if you're using a mutex to protect access to a variable, then you are unlikely to do any better by implementing your own CAS-based "mutex" to protect the same variable.
Lock free algorithms come into their own when you are working directly on a variable that's small enough to update directly with the CAS instruction itself. Instead of being...
get a mutex (by spinning on CAS, falling back on slower OS queue)
update variable
release mutex
...they're simplified (and made faster) by simply having the spin on CAS do the variable update directly. Of course, you may find the work to calculate new value from old painful to repeat speculatively, but unless there's a LOT of contention you're not wasting that effort often.
This ability to update only a single location in memory has far-reaching implications, and work-arounds can require some creativity. For example, if you had a container using lock-free algorithms, you may decide to calculate a potential change to an element in the container, but can't sync that with updating a size variable elsewhere in memory. You may need to live without size, or be able to use an approximate size where you do a CAS-spin to increment or decrement the size later, but any given read of size may be slightly wrong. You may need to merge two logically-related data structures - such as a free list and the element-container - to share an index, then bit-pack the core fields for each into the same atomically-sized word at the start of each record. These kinds of data optimisations can be very invasive, and sometimes won't get you the behavioural characteristics you'd like. Mutexes et al are much easier in this regard, and at least you know you won't need a rewrite to mutexes if requirements evolve just that step too far. That said, clever use of a lock-free approach really can be adequate for a lot of needs, and yield a very gratifying performance and scalability improvement.
A core (good) consequence of lock-free algorithms is that one thread can't be holding the mutex then happen to get swapped out by the scheduler, such that other threads can't work until it resumes; rather - with CAS - they can spin safely and efficiently without an OS fallback option.
Things that lock free algorithms can be good for include updating usage/reference counters, modifying pointers to cleanly switch the pointed-to data, free lists, linked lists, marking hash-table buckets used/unused, and load-balancing. Many others of course.
As you say, simply hiding use of mutexes behind some API is not lock free.

There are a lot of different approaches to synchronization. There are various variants of message-passing (for example, CSP) or transactional memory.
Both of these may be implemented using locks, but that's an implementation detail.
And then of course, for some purposes, there are lock-free algorithms or data-structures, which make do with just a few atomic instructions (such as compare-and-swap), but this isn't really a general-purpose replacement for locks.

There are several implementations of some data structures, which can be implemented in a lock free configuration. For example, the producer/consumer pattern can often be implemented using lock-free linked list structures.
However, most lock-free solutions require significant thought on the part of the person designing the specific program/specific problem domain. They aren't generally applicable for all problems. For examples of such implementations, take a look at Intel's Threading Building Blocks library.
Most important to note is that no lock-free solution is free. You're going to give something up to make that work, at the bare minimum in implementation complexity, and probably performance in scenarios where you're running on a single core (for example, a linked list is MUCH slower than a vector). Make sure you benchmark before using lock free on the base assumption that it would be faster.
Side note: I really hope you're not using condition variables, because there's no way to ensure that their access operates as you wish in C and C++.

Yet another library to add to your reading list: Fast Flow
What's interesting in your case is that they are based on lock-free queues. They have implemented a simple lock-free queue and then have built more complex queues out of it.
And since the code is free, you can peruse it and get the code for the lock-free queue, which is far from trivial to get right.

Related

When is std::shared_timed_mutex slower than std::mutex and when (not) to use it?

I am trying to implement a multithreaded LRU cache in C++ using this article as a hint or inspiration. It is for Go, but the concepts required more or less exist in C++ too. This article proposes to use fine-grained locking with shared mutexes around a hash table and a linked list.
So I intended to write a cache using std::unordered_map, std::list and locking with std::shared_timed_mutex. My use case includes several threads (4-8) using this cache as a storage for misspelled words and corresponding possible corrections. The size of the cache would be around 10000-100000 items.
But I read in several places that it rarely makes sense to use a shared mutex instead of a plain one and that it's slower, though I couldn't find some real benchmarks with numbers or at least vague guidelines when to use and when not to use a shared mutex. While other sources propose to use a shared mutex any time you have concurrent readers which more or less outnumber concurrent writers.
When is it better to use an std::shared_timed_mutex than a plain std::mutex? How many times should readers/reads outnumber writers/writes? Of course I get that it depends on many factors, but how should I make a decision which one to use?
Maybe it's platform-dependent and some platform implementations are worse than others? (we use Linux and Windows as targets, MSVC 2017 and GCC 5)
Does it make sense to implement cache locking as described in the article?
Does std::shared_mutex (from C++17) make any difference in performance compared to a timed one?
P.S. I feel there will be "measure/profile first what fits your case best". I would, but I need to implement one first and it would be great if there existed some heuristics to choose instead of implementing both options and measuring. Also even if I measure, I think the outcome will depend on the data which I use. And it can be hard to predict real data (e.g. for a server in a cloud).
When is it better to use an std::shared_timed_mutex than a plain std::mutex?
How many times should readers/reads outnumber writers/writes? Of course I get that it depends on many factors, but how should I make a decision which one to use?
Because of their extra complexity, cases where read/writer locks (std::shared_mutex, std::shared_timed_mutex) are superior to plain lock (std::mutex, std::timed_mutex) are rare. They do exist, but personally, I never encountered one myself.
Read/writer mutex will not improve the performance if you have frequent, but short read operations. It is better suited for scenarios were read operations are frequent and expensive. When the read operation is only a lookup in an in-memory data structure, most likely a simple lock will outperform the read/writer solution.
If the read operations are very costly and you can process many in parallel, increasing the read vs write ratio should at some point lead to a situation where read/writer will outperform an exclusive lock. Where that breaking point is depends on the real workload. I am not aware of a good rule of thumb.
Also note that performing expensive operations while holding a lock is often a bad sign. There may be better ways to solve the problem then using a read/writer lock.
Two comments on the topic from people that have far more knowledge in that field than myself:
Howard Hinnant's answering C++14 shared_timed_mutex VS C++11 mutex
Anthony Williams' quote can be found at the end of this answer (unfortunately, the link to this original post seems to be dead). He explains why read/writer locks are slow, and are often not the ideal solution.
Maybe it's platform-dependent and some platform implementations are worse than others? (we use Linux and Windows as targets, MSVC 2017 and GCC 5)
I am not aware of significant differences between operating systems. My expectation would be that the situation will be similar. On Linux, the GCC library relies on the read/writer lock implementation of glibc. If you want to dig in, you can find the implementation in pthread_rwlock_common.c. It also illustrates the extra complexity that comes with read/writer locks.
There is an old issue for the shared_mutex implementation in Boost (#11798 - Implementation of boost::shared_mutex on POSIX is suboptimal). But it is not clear to me if the implementation can be improved, or if it is only an example that is not well suited for read/writer locks.
Does it make sense to implement cache locking as described in the article?
Frankly, I am skeptical that a read/writer lock will improve performance in such a data structure. The reader operations should be extremely fast, as it is only a lookup. Updating the LRU list also happens outside the read operations (in the Go implementation).
One implementation detail. Using linked lists is not a bad idea here because it makes the update operations extremely fast (you just update pointers). When using std::list keep in mind that it normally involves memory allocations, which you should avoid when you hold the key. It is better to allocate the memory before you acquire locks, as memory allocations are expensive.
In their HHVM project, Facebook has C++ implementations of concurrent LRU caches that look promising:
ConcurrentLRUCache
ConcurrentScalableCache
The ConcurrentLRUCache also uses a linked list (but not std::list) for the LRU list, and tbb::concurrent_hash_map for the map itself (a concurrent hash map implementation from Intel). Note that for locking of the LRU list updates, they did not go for the read/writer approach as in the Go implementation but use a simple std::mutex exclusive lock.
The second implementation (ConcurrentScalableCache) builds on top of the ConcurrentLRUCache. They use sharding to improve scalability. The drawback is that the LRU property is only approximated (depending on how many shards you use). In some some workloads that might reduce the cache hit rate, but it is a nice trick to avoid that all operations have to share the same lock.
Does std::shared_mutex (from C++17) make any difference in performance compared to a timed one?
I do not have benchmark numbers about the overhead, but it looks like comparing apples and oranges. If you need the timing feature, you have no real choice but to use std::shared_timed_mutex. But if you do not need it, you can simply use std::shared_mutex, which has to do less work and thus should never be slower.
I would not expect the timing overhead to be too serious for typical scenarios when you need timeouts, as the locks tend to be hold longer in that cases anyway. But as said, I cannot back that statement with real measurements.
So, which problems are actually can be solved by std::shared_mutex.
Let's imagine you are writing some real-time audio software. You have some callback which is called by driver 1000 times per second and you have to put 1 ms of audio data into its buffer to let hardware play it in next 1 ms. And you have "big" buffer of audio data (let's say 10 seconds) which is rendered by some other thread at background and written once every 10 seconds. Also you have 10 more threads which want to read data from the same buffer (to draw something on UI, send by network, control external lights and so on). This is real tasks of real DJ-software, not a joke.
So, at every callback call (every 1 ms) you have very-very low chances to have conflict with writer thread (0.01%), but you have nearly 100% chance to have a conflict with another reader thread - they work all the time and read the same buffer! So, let's say some thread which reads data from this buffer locked std::mutex and decided to send something via network, then wait for response for the next 500 ms - you'll be locked, can't do anything, your hardware will not get next portion of sound and it will play silence (imagine this while some concert, for example). This is a disaster.
But here is the solution - use std::shared_mutex and std::shared_lock for all reader threads. Yes, average lock of std::shared_lock will cost you more (let's say not 50 nanosecons, but 100 nanoseconds - this is still very cheap even for your real-time app, which should write buffer in 1 ms max), but you are 100% safe from worst case when another reader thread locks your performance-critical thread by 500 ms.
And that's the reason to use std::shared_mutex - to avoid/improve worst cases. Not to improve average performance (this should be achived in some other ways).

Can queries to boost's rtree be done from parallel threads?

I have two modules in application. Module1 owns and builds boost::geometry::index::rtree. Module2 makes queries to Module1, which are passed to RTree. Now I want to speed up and have several Module2 instances, which make queries to one Module1 instance, and work separately. I am 100% sure, that while any Module2 working RTree does not change.
I've found this question: Can I use Boost.Geometry.index.rtree with threads?, but it describes more complicated case, when rtree is modified and queried from different threads. And this answer is ambiguous: "No boost Rtree is not thread-safe in any way" is stated in answer. But in comments it is stated: "It is safe to do queries, and it even possible to create workaround for creation". What is right answer? Are there any resources, except ask direct question to boost authors, to find out?
Tl;dr:
Is it safe to make queries to boost::geometry::index::rtree from different threads, if I am 100% sure, that no thread modifies RTree?
In answer to linked question: "No boost Rtree is not thread-safe in any way". But in comments: "It is safe to do queries, and it even possible to create workaround for creation". Who is right?
There is no contradiction. Adam is the author. Everyone is right. Note that the answer also said
You /can/ run multiple read-only operations in parallel. Usually, library containers are safe to use from multiple threads for read-only operations (although you might want to do a quick scan for any mutable members hidden (in the implementation).
In general, as long as the bitwise representation doesn't mutate, everything is safe for concurrent access. This is regardless of library support.
Note that you don't need that "quick scan" as it happens, because of the authoritative comment by Adam Wulkiewicz.
Footnote: that still doesn't make the library thread safe. That is simply true because the memory model of C++ is free of data races with bitwise constant data.
This doesn't seem to be the full question. What I'm reading is in two parts. The first part should be "I want to optimise my program. How should I go about doing that?"
You should use a profiler to take measurements before the optimisation! You might notice in the process that there are more significant optimisations available to you, and those might be pushed out of the window of possibility if you introduce multithreading prematurely.
You should use a profiler to take measurements after the optimisation! It's not uncommon for an optimisation to be found to be insignificant. In terms of multithreading optimisations, from your measurements you should see that processing one task takes slightly longer but that you can process between four and eight at once on a computer that has a four core CPU. If the slightly longer equates to a factor of 4-8x, then obviously multithreading is an unnecessary introduction of bloat and not an optimisation.
The second part, you have provided, in the form of these two statements:
I am 100% sure, that while any Module2 working RTree does not change.
Is it safe to make queries to boost::geometry::index::rtree from different threads, if I am 100% sure, that no thread modifies RTree?
You should use locks. If you don't, you'll be invoking undefined behaviour. I'll explain why you should use locks later.
I would recommend using a read/write lock (e.g. pthread_rwlock_t) for the usecase you have described. This will allow your threads to access the resource simultaneously so long as no threads are attempting to write, and provide a fence for updates to be pushed to threads.
Why should you use locks? First and foremost, they guarantee that your code will function correctly; any concerns regarding whether it's safe become invalid. Secondly, a lock provides a fence at which updates can be pushed to the thread; any concerns regarding the performance implications should be negligible when compared to the amount of gain you should see from this.
You should perform more than one task with each thread! This is why a fence is important. If your threads end up terminating and you end up creating new ones later on, you are incurring an overhead which is of course undesirable when performing an optimisation. If a thread terminates despite more of these tasks foreseen later, then that thread probably should have been suspended instead.
Expect that your optimisation might turn into a work-stealing thread pool. That is the nature of optimisations, when we're targeting the most significant one. Occasionally it is the most significant by far or perhaps the only bottleneck, after all. Optimising such bottlenecks might require extreme measures.
I emphasized "should be negligible" earlier because you're only likely to see a significant improvement in performance up to a point; it should make sense that attempting to fire up 10000 threads (each occupying between 0.5 and 4.0MB stack space for a total of 5-40GB) on a processor that has 4 cores (2500 threads per core) is not going to be very optimal. Nonetheless, this is where many people go wrong, and if they have a profiler guiding them they'll be more likely to notice...
You might even get away with running multiple tasks on one thread, if your tasks involve IO that can be made non-blocking. That's usually an optimisation I'll look into before I look at multithreading, as the profiler will highlight.

C++ threading vs. visibility issues - what's the common engineering practice?

From my studies I know the concepts of starvation, deadlock, fairness and other concurrency issues. However, theory differs from practice, to an extent, and real engineering tasks often involve greater detail than academic blah blah...
As a C++ developer I've been concerned about threading issues for a while...
Suppose you have a shared variable x which refers to some larger portion of the program's memory. The variable is shared between two threads A and B.
Now, if we consider read/write operations on x from both A and B threads, possibly at the same time, there is a need to synchronize those operations, right? So the access to x needs some form of synchronization which can be achieved for example by using mutexes.
Now lets consider another scenario where x is initially written by thread A, then passed to thread B (somehow) and that thread only reads x. The thread B then produces a response to x called y and passes it back to the thread A (again, somehow). My question is: what synchronization primitives should I use to make this scenario thread-safe. I've read about atomics and, more importantly, memory fences - are these the tools I should rely on?
This is not a typical scenario in which there is a "critical section". Instead some data is passed between threads with no possibility of concurrent writes in the same memory location. So, after being written, the data should first be "flushed" somehow, so that the other threads could see it in a valid and consistent state before reading. How is it called in the literature, is it "visibility"?
What about pthread_once and its Boost/std counterpart i.e. call_once. Does it help if both x and y are passed between threads through a sort of "message queue" which is accessed by means of "once" functionality. AFAIK it serves as a sort of memory fence but I couldn't find any confirmation for this.
What about CPU caches and their coherency? What should I know about that from the engineering point of view? Does such knowledge help in the scenario mentioned above, or any other scenario commonly encountered in C++ development?
I know I might be mixing a lot of topics but I'd like to better understand what is the common engineering practice so that I could reuse the already known patterns.
This question is primarily related to the situation in C++03 as this is my daily environment at work. Since my project mainly involves Linux then I may only use pthreads and Boost, including Boost.Atomic. But I'm also interested if anything concerning such matters has changed with the advent of C++11.
I know the question is abstract and not that precise but any input could be useful.
you have a shared variable x
That's where you've gone wrong. Threading is MUCH easier if you hand off ownership of work items using some sort of threadsafe consumer-producer queue, and from the perspective of the rest of the program, including all the business logic, nothing is shared.
Message passing also helps prevent cache collisions (because there is no true sharing -- except of the producer-consumer queue itself, and that has trivial effect on performance if the unit of work is large -- and organizing the data into messages help reduce false sharing).
Parallelism scales best when you separate the problem into subproblems. Small subproblems are also much easier to reason about.
You seem to already be thinking along these lines, but no, threading primitives like atomics, mutexes, and fences are not very good for applications using message passing. Find a real queue implementation (queue, circular ring, Disruptor, they go under different names but all meet the same need). The primitives will be used inside the queue implementation, but never by application code.

Lock-free data structures in C++ = just use atomics and memory-ordering?

I used to see the term "lock free data structure" and think "ooooo that must be really complex". However, I have been reading "C++ Concurrency in Action" and it seems to write a lock-free data structure all you do is stop using mutexes/locks and replace them with atomic code (along with possible memory-ordering barriers).
So my question is- am I missing something here? Is it really that much simpler due to C++11? Is writing a lock-free data structure just a case of replacing the locks with atomic operations?
Ooooo but that is really complex.
If you don't see the difference between a mutex and an atomic access, there is something wrong with the way you look at parallel processing, and there will soon be something wrong with the code you write.
Most likely it will run slower than the equivalent blocking version, and if you (or rather your coworkers) are really unlucky, it will spout the occasional inconsistent data and crash randomly.
Even more likely, it will propagate real-time constraints to large parts of your application, forcing your coworkers to waste a sizeable amount of their time coping with arbitrary requirements they and their software would have quite happily lived without, and resort to various superstitions good practices to obfuscate their code into submission.
Oh well, as long as the template guys and the wait-free guys had their little fun...
Parallel processing, be it blocking or supposedly wait-free, is inherently resource consuming,complex and costly to implement. Designing a software architecture that takes a real advantage from non-trivial parallel processing is a job for specialists.
A good software design should on the contrary limit the parallelism to the bare minimum, leaving most of the programmers free to implement linear, sequential code.
As for C++, I find this whole philosophy of wrapping indifferently a string, a thread and a coffee machine in the same syntactic goo a disastrous design choice.
C++ is allowing you to create a multiprocessor synchronization object out of about anything, like you would allocate a mere string, which is akin to presenting an assault rifle next to a squirt gun in the same display case.
No doubt a lot of people are making a living by selling the idea that an assault rifle and a squirt gun are, after all, not so different. But still, they are.
Two things to consider:
Only a single operations is atomic when using C++11 atomic. But often when you want to use mutexes to protect a larger region of code.
If you use std::atomic with a type that the compiler cannot convert to an atomic operation in machine code, then the compiler will have to insert a mutex for that operation.
Overall you probably want to stick with using mutexes and only use lock-free code for performance critical sections, or if you were implementing your own structures to use for synchronization.
You are missing something. While lock free data structures do use the primitives you mention, simply invoking their existance will not provide you with a lock free queue.
Lock-free codes are not simpler due to C++, without C++, operating systems often provides similar stuff for memory ordering and fencing in C/Assembly.
C++ provides a better & easier to use interface (and of course more standardized so you can use in multiple OS, multiple machine structures with the same interface), but programming lock-free codes in C++ won't be simpler than without C++ if you target only one specific type of OS/machine structure.

Best approach to synchronising properties across threads

I'm looking for some advice on the best approach to synchronising access to properties of an object in C++. The application has an internal cache of objects which have 10 properties. These objects are to be requested in sets which can then have their properties modified and be re-saved. They can be accessed by 2-4 threads at any given time but access is not intense so my options are:
Lock the property accessors for each object using a critical section. This means lots of critical sections - one for each object.
Return copies of the objects when requested and have an update function which locks a single critical section to update the object properties when appropriate.
I think option 2 seems the most efficient but I just want to see if I'm missing a hidden 3rd option which would be more appropriate.
Thanks,
J
Firstly, I think you are worrying about the wrong thing. How do you know locking or copying causes bottlenecks in your code? Critical sections are fairly lightweight and don't cause much overhead, or, at least, not as much as you think. Just use the most lightweight locking primitive available. If you anticipate your system to be run on multi-processor hardware, you can even use a spinlock.
Secondly, do worry about simplicity of your concurrency model, before performance (hint: simpler model is easier to understand, to get right and to optimize). So if you can afford it, take copies of objects, this will ease the pain of dealing with TOCTOU race conditions in case you are doing complex transformations on the object set that depend on a number of previous values.
You don't want critical sections, you want a mutex.
It's perfectly reasonable to have a single mutex for each object. Lock the mutex before ever reading or writing any of the properties, then unlock it quickly when done.
Mutexes have pretty low overhead when there's no contention. When there's lots of contention, they can definitely slow your program down.
Depending on how long it take to read a property from an object (I'm guessing it should be fairly trivial, like reading an int or std::string), you could use spin-locks as #3. They are the fastest way to synchronize threads. And maybe option #4, valid only for ints, is to do no locking at all and use only atomic operations. Maybe, the most efficiente solution, would be using atomics for all ints, per-property spin-locks for simple types (PODs and simple objects like std::string) and per-object mutexes/CS for anything more complex.
Only a profiler will be able to tell you which is the best option.