Best approach to synchronising properties across threads - c++

I'm looking for some advice on the best approach to synchronising access to properties of an object in C++. The application has an internal cache of objects which have 10 properties. These objects are to be requested in sets which can then have their properties modified and be re-saved. They can be accessed by 2-4 threads at any given time but access is not intense so my options are:
Lock the property accessors for each object using a critical section. This means lots of critical sections - one for each object.
Return copies of the objects when requested and have an update function which locks a single critical section to update the object properties when appropriate.
I think option 2 seems the most efficient but I just want to see if I'm missing a hidden 3rd option which would be more appropriate.
Thanks,
J

Firstly, I think you are worrying about the wrong thing. How do you know locking or copying causes bottlenecks in your code? Critical sections are fairly lightweight and don't cause much overhead, or, at least, not as much as you think. Just use the most lightweight locking primitive available. If you anticipate your system to be run on multi-processor hardware, you can even use a spinlock.
Secondly, do worry about simplicity of your concurrency model, before performance (hint: simpler model is easier to understand, to get right and to optimize). So if you can afford it, take copies of objects, this will ease the pain of dealing with TOCTOU race conditions in case you are doing complex transformations on the object set that depend on a number of previous values.

You don't want critical sections, you want a mutex.
It's perfectly reasonable to have a single mutex for each object. Lock the mutex before ever reading or writing any of the properties, then unlock it quickly when done.
Mutexes have pretty low overhead when there's no contention. When there's lots of contention, they can definitely slow your program down.

Depending on how long it take to read a property from an object (I'm guessing it should be fairly trivial, like reading an int or std::string), you could use spin-locks as #3. They are the fastest way to synchronize threads. And maybe option #4, valid only for ints, is to do no locking at all and use only atomic operations. Maybe, the most efficiente solution, would be using atomics for all ints, per-property spin-locks for simple types (PODs and simple objects like std::string) and per-object mutexes/CS for anything more complex.
Only a profiler will be able to tell you which is the best option.

Related

Does Multiple reader single writer implementation in g++-4.4(Not in C++11/14) via boost::shared_mutex impact performance?

Usage: In our production we have around 100 thread which can access the cache we are trying to implement. If cache is missed then information will be fetched from the database and cache will be updated via writer thread.
To achieve this we are planning to implement multiple read and single writer We cannot update the g++ version since we are using g++-4.4
Update: Each worker thread can work for both read and write. If cache is missed then information is cached from the DB.
Problem Statement:
We need to implement the cache to enhance the performance.
For this, cache read are more frequent and write operations to the cache is very much less.
I think we can use boost::shared_mutex boost::shared_lock, boost::upgrade_lock, boost::upgrade_to_unique_lock implementation
But we learnt that boost::shared_mutex has performance issues:
Performance comparison on reader writer locks
Lib boost devel
Questions
Does boost::shared_mutex impact the performance in case read are much frequent?
What are other constructs and design approaches we can take while considering compiler version g++4.4?
Is there a way-around on how to design it, such that reads are lock free?
Also, we are intended to use Map to keep the information for cache.
If writes were non-existent, one possibility can be 2-level cache where you first have a thread-local cache, and then the normal cache with mutex or reader/writer lock.
If writes are extremely rare, you can do the same. But have some lock-free way of invalidating the thread-local cache, e.g. an atomic int updated with every write, and in those cases clear the thread-local cache.
You need to profile it.
In case you're stuck because you don't have a "similar enough" environment where you can actually test things, you can probably write a simple wrapper using pthreads: pthread_rwlock_t
pthread_rwlock_rdlock
pthread_rwlock_wrlock
pthread_rwlock_unlock
Of course you can design things to be lock free. Most obvious solution would be to not share state. (If you do share state, you'll have to check that your target platform supports atomic instructions). However, without any knowledge of your application domain, I feel very safe suggesting you do not want lock-free. See e.g. Do lock-free algorithms really perform better than their lock-full counterparts?
It all depends on the frequency of the updates, the size of the cache and how much is changed in the update.
Let's assume you have a rather big cache with a lot of changes on each update. Then I would use a read-copy-update pattern, which is lock-free.
If your cached data is pretty small and one time read (e.g. a single integer) RCU is also a good choice.
A big cache, with small updates or a big cache with updates which are to frequent for RCU a Read-Write Lock is a good choice.
Alongside other answers suggesting you profile it, a large benefit can be had if you can somehow structure or predict the type, order and size of the requests.
If particular types of data are requested in a typical cycle, it would be better to split up the cache per data type. You will increase cache-hit/miss ratios and the size of each cache can be adapted to the type. You will also reduce possible contention.
Likewise, the size of the requests is important when choosing your update approach. Smaller data fragments may be stored longer or even pooled together, while larger chunks may be requested less frequently.
Even with a basic prediction scheme in place that covers only the most frequent fetch patterns, you may already improve performance quite a bit. It's definitely worth it to try and train e.g. a NN (Neural Network) to guess the next request in advance.

Can queries to boost's rtree be done from parallel threads?

I have two modules in application. Module1 owns and builds boost::geometry::index::rtree. Module2 makes queries to Module1, which are passed to RTree. Now I want to speed up and have several Module2 instances, which make queries to one Module1 instance, and work separately. I am 100% sure, that while any Module2 working RTree does not change.
I've found this question: Can I use Boost.Geometry.index.rtree with threads?, but it describes more complicated case, when rtree is modified and queried from different threads. And this answer is ambiguous: "No boost Rtree is not thread-safe in any way" is stated in answer. But in comments it is stated: "It is safe to do queries, and it even possible to create workaround for creation". What is right answer? Are there any resources, except ask direct question to boost authors, to find out?
Tl;dr:
Is it safe to make queries to boost::geometry::index::rtree from different threads, if I am 100% sure, that no thread modifies RTree?
In answer to linked question: "No boost Rtree is not thread-safe in any way". But in comments: "It is safe to do queries, and it even possible to create workaround for creation". Who is right?
There is no contradiction. Adam is the author. Everyone is right. Note that the answer also said
You /can/ run multiple read-only operations in parallel. Usually, library containers are safe to use from multiple threads for read-only operations (although you might want to do a quick scan for any mutable members hidden (in the implementation).
In general, as long as the bitwise representation doesn't mutate, everything is safe for concurrent access. This is regardless of library support.
Note that you don't need that "quick scan" as it happens, because of the authoritative comment by Adam Wulkiewicz.
Footnote: that still doesn't make the library thread safe. That is simply true because the memory model of C++ is free of data races with bitwise constant data.
This doesn't seem to be the full question. What I'm reading is in two parts. The first part should be "I want to optimise my program. How should I go about doing that?"
You should use a profiler to take measurements before the optimisation! You might notice in the process that there are more significant optimisations available to you, and those might be pushed out of the window of possibility if you introduce multithreading prematurely.
You should use a profiler to take measurements after the optimisation! It's not uncommon for an optimisation to be found to be insignificant. In terms of multithreading optimisations, from your measurements you should see that processing one task takes slightly longer but that you can process between four and eight at once on a computer that has a four core CPU. If the slightly longer equates to a factor of 4-8x, then obviously multithreading is an unnecessary introduction of bloat and not an optimisation.
The second part, you have provided, in the form of these two statements:
I am 100% sure, that while any Module2 working RTree does not change.
Is it safe to make queries to boost::geometry::index::rtree from different threads, if I am 100% sure, that no thread modifies RTree?
You should use locks. If you don't, you'll be invoking undefined behaviour. I'll explain why you should use locks later.
I would recommend using a read/write lock (e.g. pthread_rwlock_t) for the usecase you have described. This will allow your threads to access the resource simultaneously so long as no threads are attempting to write, and provide a fence for updates to be pushed to threads.
Why should you use locks? First and foremost, they guarantee that your code will function correctly; any concerns regarding whether it's safe become invalid. Secondly, a lock provides a fence at which updates can be pushed to the thread; any concerns regarding the performance implications should be negligible when compared to the amount of gain you should see from this.
You should perform more than one task with each thread! This is why a fence is important. If your threads end up terminating and you end up creating new ones later on, you are incurring an overhead which is of course undesirable when performing an optimisation. If a thread terminates despite more of these tasks foreseen later, then that thread probably should have been suspended instead.
Expect that your optimisation might turn into a work-stealing thread pool. That is the nature of optimisations, when we're targeting the most significant one. Occasionally it is the most significant by far or perhaps the only bottleneck, after all. Optimising such bottlenecks might require extreme measures.
I emphasized "should be negligible" earlier because you're only likely to see a significant improvement in performance up to a point; it should make sense that attempting to fire up 10000 threads (each occupying between 0.5 and 4.0MB stack space for a total of 5-40GB) on a processor that has 4 cores (2500 threads per core) is not going to be very optimal. Nonetheless, this is where many people go wrong, and if they have a profiler guiding them they'll be more likely to notice...
You might even get away with running multiple tasks on one thread, if your tasks involve IO that can be made non-blocking. That's usually an optimisation I'll look into before I look at multithreading, as the profiler will highlight.

C++ threading vs. visibility issues - what's the common engineering practice?

From my studies I know the concepts of starvation, deadlock, fairness and other concurrency issues. However, theory differs from practice, to an extent, and real engineering tasks often involve greater detail than academic blah blah...
As a C++ developer I've been concerned about threading issues for a while...
Suppose you have a shared variable x which refers to some larger portion of the program's memory. The variable is shared between two threads A and B.
Now, if we consider read/write operations on x from both A and B threads, possibly at the same time, there is a need to synchronize those operations, right? So the access to x needs some form of synchronization which can be achieved for example by using mutexes.
Now lets consider another scenario where x is initially written by thread A, then passed to thread B (somehow) and that thread only reads x. The thread B then produces a response to x called y and passes it back to the thread A (again, somehow). My question is: what synchronization primitives should I use to make this scenario thread-safe. I've read about atomics and, more importantly, memory fences - are these the tools I should rely on?
This is not a typical scenario in which there is a "critical section". Instead some data is passed between threads with no possibility of concurrent writes in the same memory location. So, after being written, the data should first be "flushed" somehow, so that the other threads could see it in a valid and consistent state before reading. How is it called in the literature, is it "visibility"?
What about pthread_once and its Boost/std counterpart i.e. call_once. Does it help if both x and y are passed between threads through a sort of "message queue" which is accessed by means of "once" functionality. AFAIK it serves as a sort of memory fence but I couldn't find any confirmation for this.
What about CPU caches and their coherency? What should I know about that from the engineering point of view? Does such knowledge help in the scenario mentioned above, or any other scenario commonly encountered in C++ development?
I know I might be mixing a lot of topics but I'd like to better understand what is the common engineering practice so that I could reuse the already known patterns.
This question is primarily related to the situation in C++03 as this is my daily environment at work. Since my project mainly involves Linux then I may only use pthreads and Boost, including Boost.Atomic. But I'm also interested if anything concerning such matters has changed with the advent of C++11.
I know the question is abstract and not that precise but any input could be useful.
you have a shared variable x
That's where you've gone wrong. Threading is MUCH easier if you hand off ownership of work items using some sort of threadsafe consumer-producer queue, and from the perspective of the rest of the program, including all the business logic, nothing is shared.
Message passing also helps prevent cache collisions (because there is no true sharing -- except of the producer-consumer queue itself, and that has trivial effect on performance if the unit of work is large -- and organizing the data into messages help reduce false sharing).
Parallelism scales best when you separate the problem into subproblems. Small subproblems are also much easier to reason about.
You seem to already be thinking along these lines, but no, threading primitives like atomics, mutexes, and fences are not very good for applications using message passing. Find a real queue implementation (queue, circular ring, Disruptor, they go under different names but all meet the same need). The primitives will be used inside the queue implementation, but never by application code.

Lock-free data structures in C++ = just use atomics and memory-ordering?

I used to see the term "lock free data structure" and think "ooooo that must be really complex". However, I have been reading "C++ Concurrency in Action" and it seems to write a lock-free data structure all you do is stop using mutexes/locks and replace them with atomic code (along with possible memory-ordering barriers).
So my question is- am I missing something here? Is it really that much simpler due to C++11? Is writing a lock-free data structure just a case of replacing the locks with atomic operations?
Ooooo but that is really complex.
If you don't see the difference between a mutex and an atomic access, there is something wrong with the way you look at parallel processing, and there will soon be something wrong with the code you write.
Most likely it will run slower than the equivalent blocking version, and if you (or rather your coworkers) are really unlucky, it will spout the occasional inconsistent data and crash randomly.
Even more likely, it will propagate real-time constraints to large parts of your application, forcing your coworkers to waste a sizeable amount of their time coping with arbitrary requirements they and their software would have quite happily lived without, and resort to various superstitions good practices to obfuscate their code into submission.
Oh well, as long as the template guys and the wait-free guys had their little fun...
Parallel processing, be it blocking or supposedly wait-free, is inherently resource consuming,complex and costly to implement. Designing a software architecture that takes a real advantage from non-trivial parallel processing is a job for specialists.
A good software design should on the contrary limit the parallelism to the bare minimum, leaving most of the programmers free to implement linear, sequential code.
As for C++, I find this whole philosophy of wrapping indifferently a string, a thread and a coffee machine in the same syntactic goo a disastrous design choice.
C++ is allowing you to create a multiprocessor synchronization object out of about anything, like you would allocate a mere string, which is akin to presenting an assault rifle next to a squirt gun in the same display case.
No doubt a lot of people are making a living by selling the idea that an assault rifle and a squirt gun are, after all, not so different. But still, they are.
Two things to consider:
Only a single operations is atomic when using C++11 atomic. But often when you want to use mutexes to protect a larger region of code.
If you use std::atomic with a type that the compiler cannot convert to an atomic operation in machine code, then the compiler will have to insert a mutex for that operation.
Overall you probably want to stick with using mutexes and only use lock-free code for performance critical sections, or if you were implementing your own structures to use for synchronization.
You are missing something. While lock free data structures do use the primitives you mention, simply invoking their existance will not provide you with a lock free queue.
Lock-free codes are not simpler due to C++, without C++, operating systems often provides similar stuff for memory ordering and fencing in C/Assembly.
C++ provides a better & easier to use interface (and of course more standardized so you can use in multiple OS, multiple machine structures with the same interface), but programming lock-free codes in C++ won't be simpler than without C++ if you target only one specific type of OS/machine structure.

Alternatives for locks for synchronisation

I'm currently in the process of developing my own little threading library, mainly for learning purposes, and am at the part of the message queue which will involve a lot of synchronisation in various places. Previously I've mainly used locks, mutexes and condition variables a bit which all are variations of the same theme, a lock for a section that should only be used by one thread at a time.
Are there any different solutions to synchronisation than using locks? I've read lock-free synchronization at places, but some consider hiding the locks in containers to be lock-free, which I disagree with. you just don't explicitly use the locks yourself.
Lock-free algorithms typically involve using compare-and-swap (CAS) or similar CPU instructions that update some value in memory not only atomically, but also conditionally and with an indicator of success. That way you can code something like this:
1 do
2 {
3 current_value = the_varibale
4 new_value = ...some expression using current_value...
5 } while(!compare_and_swap(the_variable, current_value, new_value));
compare_and_swap() atomically checks whether the_variable's value is still current_value, and only if that's so will it update the_variable's value to new_value and return true
exact calling syntax will vary with the CPU, and may involve assembly language or system/compiler-provided wrapper functions (use the latter if available - there may be other compiler optimisations or issues that their usage restricts to safe behaviours); generally, check your docs
The significance is that when another thread updates the variable after the read on line 3 but before the CAS on line 5 attempts the update, the compare and swap instruction will fail because the state from which you're updating is not the one you used to calculate the desired target state. Such do/while loops can be said to "spin" rather than lock, as they go round and round the loop until CAS succeeds.
Crucially, your existing threading library can be expected to have a two-stage locking approach for mutex, read-write locks etc. involving:
First stage: spinning using CAS or similar (i.e. spin on { read the current value, if it's not set then cas(current = not set, new = set) }) - which means other threads doing a quick update often won't result in your thread swapping out to wait, and all the relatively time-consuming overheads associated with that.
The second stage is only used if some limit of loop iterations or elapsed time is exceeded: it asks the operating system to queue the thread until it knows (or at least suspects) the lock is free to acquire.
The implication of this is that if you're using a mutex to protect access to a variable, then you are unlikely to do any better by implementing your own CAS-based "mutex" to protect the same variable.
Lock free algorithms come into their own when you are working directly on a variable that's small enough to update directly with the CAS instruction itself. Instead of being...
get a mutex (by spinning on CAS, falling back on slower OS queue)
update variable
release mutex
...they're simplified (and made faster) by simply having the spin on CAS do the variable update directly. Of course, you may find the work to calculate new value from old painful to repeat speculatively, but unless there's a LOT of contention you're not wasting that effort often.
This ability to update only a single location in memory has far-reaching implications, and work-arounds can require some creativity. For example, if you had a container using lock-free algorithms, you may decide to calculate a potential change to an element in the container, but can't sync that with updating a size variable elsewhere in memory. You may need to live without size, or be able to use an approximate size where you do a CAS-spin to increment or decrement the size later, but any given read of size may be slightly wrong. You may need to merge two logically-related data structures - such as a free list and the element-container - to share an index, then bit-pack the core fields for each into the same atomically-sized word at the start of each record. These kinds of data optimisations can be very invasive, and sometimes won't get you the behavioural characteristics you'd like. Mutexes et al are much easier in this regard, and at least you know you won't need a rewrite to mutexes if requirements evolve just that step too far. That said, clever use of a lock-free approach really can be adequate for a lot of needs, and yield a very gratifying performance and scalability improvement.
A core (good) consequence of lock-free algorithms is that one thread can't be holding the mutex then happen to get swapped out by the scheduler, such that other threads can't work until it resumes; rather - with CAS - they can spin safely and efficiently without an OS fallback option.
Things that lock free algorithms can be good for include updating usage/reference counters, modifying pointers to cleanly switch the pointed-to data, free lists, linked lists, marking hash-table buckets used/unused, and load-balancing. Many others of course.
As you say, simply hiding use of mutexes behind some API is not lock free.
There are a lot of different approaches to synchronization. There are various variants of message-passing (for example, CSP) or transactional memory.
Both of these may be implemented using locks, but that's an implementation detail.
And then of course, for some purposes, there are lock-free algorithms or data-structures, which make do with just a few atomic instructions (such as compare-and-swap), but this isn't really a general-purpose replacement for locks.
There are several implementations of some data structures, which can be implemented in a lock free configuration. For example, the producer/consumer pattern can often be implemented using lock-free linked list structures.
However, most lock-free solutions require significant thought on the part of the person designing the specific program/specific problem domain. They aren't generally applicable for all problems. For examples of such implementations, take a look at Intel's Threading Building Blocks library.
Most important to note is that no lock-free solution is free. You're going to give something up to make that work, at the bare minimum in implementation complexity, and probably performance in scenarios where you're running on a single core (for example, a linked list is MUCH slower than a vector). Make sure you benchmark before using lock free on the base assumption that it would be faster.
Side note: I really hope you're not using condition variables, because there's no way to ensure that their access operates as you wish in C and C++.
Yet another library to add to your reading list: Fast Flow
What's interesting in your case is that they are based on lock-free queues. They have implemented a simple lock-free queue and then have built more complex queues out of it.
And since the code is free, you can peruse it and get the code for the lock-free queue, which is far from trivial to get right.