Is it possible to implement lock free map in C++ - c++

We are developing a network application based C/S, we find there are too many locks adding to std::map that the performance of server became poor.
I wonder if it is possible to implement a lock-free map, if yes, how? Is there any open source code there?
EDIT:
Actually we use the std::map to store sockets information, we did encapsulation based on the socket file description to include some other necessary information such as ip address, port, socket type, tcp or udp, etc.
To summary, we have a global map say it's
map<int fileDescriptor, socketInfor*> SocketsMap,
then every thread which is used to send data needs to access SocketsMap, and they have to add mutex before reading from SocketsMap or writing to SocketsMap, thus the concurrency level of the whole application would be greatly decreased because of so many locks addding to SocketsMap.
To avoid the concurrency level problem, we have two solutions: 1. store each socketInfor* separately 2. use some kind of lock-free map.
I would like to find some kind of lock-free map, because codes changes required by this solution are much less than that of solution 1.

Actually there's a way, although I haven't implemented it myself there's a paper on a lock free map using hazard pointers from eminent C++ expert Andrei Alexandrescu.

Yes, I have implemented a Lock-Free Unordered Map (docs) in C++ using the "Split-Ordered Lists" concept. It's an auto-expanding container and supports millions of elements on a 64-bit CAS without ABA issues. Performance-wise, it's a beast (see page 5). It's been extensively tested with millions of random ops.

HashMap would suit? Have a look at Intel Threading Building Blocks, they have an interesting concurrent map. I'm not sure it's lock-free, but hopefully you're interested in good multithreading performance, not particularly in lock-freeness. Also you can check CityHash lib
EDIT:
Actually TBB's hash map is not lock-free

I'm surprised nobody has mentioned it, but Click Cliff has implemented a wait-free hashmap in Java, which I believe could be ported to C++,

If you use C++11, you can have a look at AtomicHashMap of facebook/folly

You can implement the map using optimistic design or transactional memory.
This approach is especially effective if the chance of two operations concurrently addressing the map and one is changing the structure of it is relatively small - and you do not want the overhead of locking every time.
However, from time to time - collision will occur, and you will have to result it somehow (usually by rolling back to the last stable state and retrying the operations).
If your hardware support good enough atomic operations - this can be easily done with Compare And Swap (CAS) - where you change the reference alone (and whenever you change the map, you work on a copy of the map, and not the original, and set it as the primary only when you commit).

Related

When is std::shared_timed_mutex slower than std::mutex and when (not) to use it?

I am trying to implement a multithreaded LRU cache in C++ using this article as a hint or inspiration. It is for Go, but the concepts required more or less exist in C++ too. This article proposes to use fine-grained locking with shared mutexes around a hash table and a linked list.
So I intended to write a cache using std::unordered_map, std::list and locking with std::shared_timed_mutex. My use case includes several threads (4-8) using this cache as a storage for misspelled words and corresponding possible corrections. The size of the cache would be around 10000-100000 items.
But I read in several places that it rarely makes sense to use a shared mutex instead of a plain one and that it's slower, though I couldn't find some real benchmarks with numbers or at least vague guidelines when to use and when not to use a shared mutex. While other sources propose to use a shared mutex any time you have concurrent readers which more or less outnumber concurrent writers.
When is it better to use an std::shared_timed_mutex than a plain std::mutex? How many times should readers/reads outnumber writers/writes? Of course I get that it depends on many factors, but how should I make a decision which one to use?
Maybe it's platform-dependent and some platform implementations are worse than others? (we use Linux and Windows as targets, MSVC 2017 and GCC 5)
Does it make sense to implement cache locking as described in the article?
Does std::shared_mutex (from C++17) make any difference in performance compared to a timed one?
P.S. I feel there will be "measure/profile first what fits your case best". I would, but I need to implement one first and it would be great if there existed some heuristics to choose instead of implementing both options and measuring. Also even if I measure, I think the outcome will depend on the data which I use. And it can be hard to predict real data (e.g. for a server in a cloud).
When is it better to use an std::shared_timed_mutex than a plain std::mutex?
How many times should readers/reads outnumber writers/writes? Of course I get that it depends on many factors, but how should I make a decision which one to use?
Because of their extra complexity, cases where read/writer locks (std::shared_mutex, std::shared_timed_mutex) are superior to plain lock (std::mutex, std::timed_mutex) are rare. They do exist, but personally, I never encountered one myself.
Read/writer mutex will not improve the performance if you have frequent, but short read operations. It is better suited for scenarios were read operations are frequent and expensive. When the read operation is only a lookup in an in-memory data structure, most likely a simple lock will outperform the read/writer solution.
If the read operations are very costly and you can process many in parallel, increasing the read vs write ratio should at some point lead to a situation where read/writer will outperform an exclusive lock. Where that breaking point is depends on the real workload. I am not aware of a good rule of thumb.
Also note that performing expensive operations while holding a lock is often a bad sign. There may be better ways to solve the problem then using a read/writer lock.
Two comments on the topic from people that have far more knowledge in that field than myself:
Howard Hinnant's answering C++14 shared_timed_mutex VS C++11 mutex
Anthony Williams' quote can be found at the end of this answer (unfortunately, the link to this original post seems to be dead). He explains why read/writer locks are slow, and are often not the ideal solution.
Maybe it's platform-dependent and some platform implementations are worse than others? (we use Linux and Windows as targets, MSVC 2017 and GCC 5)
I am not aware of significant differences between operating systems. My expectation would be that the situation will be similar. On Linux, the GCC library relies on the read/writer lock implementation of glibc. If you want to dig in, you can find the implementation in pthread_rwlock_common.c. It also illustrates the extra complexity that comes with read/writer locks.
There is an old issue for the shared_mutex implementation in Boost (#11798 - Implementation of boost::shared_mutex on POSIX is suboptimal). But it is not clear to me if the implementation can be improved, or if it is only an example that is not well suited for read/writer locks.
Does it make sense to implement cache locking as described in the article?
Frankly, I am skeptical that a read/writer lock will improve performance in such a data structure. The reader operations should be extremely fast, as it is only a lookup. Updating the LRU list also happens outside the read operations (in the Go implementation).
One implementation detail. Using linked lists is not a bad idea here because it makes the update operations extremely fast (you just update pointers). When using std::list keep in mind that it normally involves memory allocations, which you should avoid when you hold the key. It is better to allocate the memory before you acquire locks, as memory allocations are expensive.
In their HHVM project, Facebook has C++ implementations of concurrent LRU caches that look promising:
ConcurrentLRUCache
ConcurrentScalableCache
The ConcurrentLRUCache also uses a linked list (but not std::list) for the LRU list, and tbb::concurrent_hash_map for the map itself (a concurrent hash map implementation from Intel). Note that for locking of the LRU list updates, they did not go for the read/writer approach as in the Go implementation but use a simple std::mutex exclusive lock.
The second implementation (ConcurrentScalableCache) builds on top of the ConcurrentLRUCache. They use sharding to improve scalability. The drawback is that the LRU property is only approximated (depending on how many shards you use). In some some workloads that might reduce the cache hit rate, but it is a nice trick to avoid that all operations have to share the same lock.
Does std::shared_mutex (from C++17) make any difference in performance compared to a timed one?
I do not have benchmark numbers about the overhead, but it looks like comparing apples and oranges. If you need the timing feature, you have no real choice but to use std::shared_timed_mutex. But if you do not need it, you can simply use std::shared_mutex, which has to do less work and thus should never be slower.
I would not expect the timing overhead to be too serious for typical scenarios when you need timeouts, as the locks tend to be hold longer in that cases anyway. But as said, I cannot back that statement with real measurements.
So, which problems are actually can be solved by std::shared_mutex.
Let's imagine you are writing some real-time audio software. You have some callback which is called by driver 1000 times per second and you have to put 1 ms of audio data into its buffer to let hardware play it in next 1 ms. And you have "big" buffer of audio data (let's say 10 seconds) which is rendered by some other thread at background and written once every 10 seconds. Also you have 10 more threads which want to read data from the same buffer (to draw something on UI, send by network, control external lights and so on). This is real tasks of real DJ-software, not a joke.
So, at every callback call (every 1 ms) you have very-very low chances to have conflict with writer thread (0.01%), but you have nearly 100% chance to have a conflict with another reader thread - they work all the time and read the same buffer! So, let's say some thread which reads data from this buffer locked std::mutex and decided to send something via network, then wait for response for the next 500 ms - you'll be locked, can't do anything, your hardware will not get next portion of sound and it will play silence (imagine this while some concert, for example). This is a disaster.
But here is the solution - use std::shared_mutex and std::shared_lock for all reader threads. Yes, average lock of std::shared_lock will cost you more (let's say not 50 nanosecons, but 100 nanoseconds - this is still very cheap even for your real-time app, which should write buffer in 1 ms max), but you are 100% safe from worst case when another reader thread locks your performance-critical thread by 500 ms.
And that's the reason to use std::shared_mutex - to avoid/improve worst cases. Not to improve average performance (this should be achived in some other ways).

Lock-free data structures in C++ = just use atomics and memory-ordering?

I used to see the term "lock free data structure" and think "ooooo that must be really complex". However, I have been reading "C++ Concurrency in Action" and it seems to write a lock-free data structure all you do is stop using mutexes/locks and replace them with atomic code (along with possible memory-ordering barriers).
So my question is- am I missing something here? Is it really that much simpler due to C++11? Is writing a lock-free data structure just a case of replacing the locks with atomic operations?
Ooooo but that is really complex.
If you don't see the difference between a mutex and an atomic access, there is something wrong with the way you look at parallel processing, and there will soon be something wrong with the code you write.
Most likely it will run slower than the equivalent blocking version, and if you (or rather your coworkers) are really unlucky, it will spout the occasional inconsistent data and crash randomly.
Even more likely, it will propagate real-time constraints to large parts of your application, forcing your coworkers to waste a sizeable amount of their time coping with arbitrary requirements they and their software would have quite happily lived without, and resort to various superstitions good practices to obfuscate their code into submission.
Oh well, as long as the template guys and the wait-free guys had their little fun...
Parallel processing, be it blocking or supposedly wait-free, is inherently resource consuming,complex and costly to implement. Designing a software architecture that takes a real advantage from non-trivial parallel processing is a job for specialists.
A good software design should on the contrary limit the parallelism to the bare minimum, leaving most of the programmers free to implement linear, sequential code.
As for C++, I find this whole philosophy of wrapping indifferently a string, a thread and a coffee machine in the same syntactic goo a disastrous design choice.
C++ is allowing you to create a multiprocessor synchronization object out of about anything, like you would allocate a mere string, which is akin to presenting an assault rifle next to a squirt gun in the same display case.
No doubt a lot of people are making a living by selling the idea that an assault rifle and a squirt gun are, after all, not so different. But still, they are.
Two things to consider:
Only a single operations is atomic when using C++11 atomic. But often when you want to use mutexes to protect a larger region of code.
If you use std::atomic with a type that the compiler cannot convert to an atomic operation in machine code, then the compiler will have to insert a mutex for that operation.
Overall you probably want to stick with using mutexes and only use lock-free code for performance critical sections, or if you were implementing your own structures to use for synchronization.
You are missing something. While lock free data structures do use the primitives you mention, simply invoking their existance will not provide you with a lock free queue.
Lock-free codes are not simpler due to C++, without C++, operating systems often provides similar stuff for memory ordering and fencing in C/Assembly.
C++ provides a better & easier to use interface (and of course more standardized so you can use in multiple OS, multiple machine structures with the same interface), but programming lock-free codes in C++ won't be simpler than without C++ if you target only one specific type of OS/machine structure.

(preferably boost) lock-free array/vector/map/etc?

Considering my lack of c++ knowledge, please try to read my intent and not my poor technical question.
This is the backbone of my program https://github.com/zaphoyd/websocketpp/blob/experimental/examples/broadcast_server/broadcast_server.cpp
I'm building a websocket server with websocket++ (and oh is websocket++ sweet. I highly recommend), and I can easily manipulate per user data thread-safely because it really doesn't need to be manipulated by different threads; however, I do want to be able to write to an array (I'm going to use the catch-all term "array" from weaker languages like vb, php, js) in one function thread (with multiple iterations that could be running simultanously) and also read in 1 or more threads.
Take stack as an example: if I wanted to have all of the ids (PRIMARY column of all articles) sorted in a particular way, in this case by net votes, and held in memory, I'm thinking I would have a function that's called in its' own boost::thread, fired whenever a vote on the site comes in to reorder the array.
How can I do this without locking & blocking? I'm 100% fine with users reading from an old array while another is being built, but I absolutely do not want their reads or the thread writes to ever fail/be blocked.
Does a lock-free array exist? If not, is there some way to build the new array in a temporary array and then write it to the actual array when the building is finished without locking & blocking?
Have you looked at Boost.Lockfree?
Uh, uh, uh. Complicated.
Look here (for an example): RCU -- and this is only about multiple reads along with ONE write.
My guess is that multiple writers at once are not going to work. You should rather look for a more efficient representation than an array, one that allows for faster updates. How about a balanced tree? log(n) should never block anything in a noticeable fashion.
Regarding boost -- I'm happy that it finally has proper support for thread synchronization.
Of course, you could also keep a copy and batch the updates. Then a background process merges the updates and copies the result for the readers.

Synchronized unordered_map in C++

I am using unordered_map from Boost. Are there any synchronized version of unordered_map? This is because I have quite a large number of unordered_map and manually synchronizing it using lock would be very messy.
Thanks.
It's impossible to usefully encapsulate containers offering STL-like interfaces (which unordered_map also does) with automatic locking because there are race conditions associated with retrieving iterators and positions inside the string then trying to use them in later operations. If you can find some less flexible interface that suits your needs, perhaps putting any complex operations into single locked function calls, then you can easily wrap a thread-safe class around the container to simplify your usage.
Are you sure that is what you need ?
while (!stack.empty())
{
Element const e = stack.top();
stack.pop();
}
In a single thread, this code looks right. If you wish to go multi-thread however, simply having a synchronized stack just doesn't cut it.
What happens if anyone else pops the last element AFTER you tested for emptiness ?
There is more than container synchronization to go multi-thread. That said, you could try TBB out.
Use Folly's AtomicHashmap.
From Folly's documentation on Github
folly/AtomicHashmap.h introduces a synchronized UnorderedAssociativeContainer implementation designed for extreme performance in heavily multithreaded environments (about 2-5x faster than tbb::concurrent_hash_map) and good memory usage properties. Find and iteration are wait-free, insert has key-level lock granularity, there is minimal memory overhead, and permanent 32-bit ids can be used to reference each element.
It comes with some limitations though.
Intel's Thread Building Blocks library has a class tbb::concurrent_hash_map that is an unordered map, allowing concurrent access. Internally it is implemented using a fine-grained locking scheme, but the basic outcome is that you can access it without race conditions.

Alternatives for locks for synchronisation

I'm currently in the process of developing my own little threading library, mainly for learning purposes, and am at the part of the message queue which will involve a lot of synchronisation in various places. Previously I've mainly used locks, mutexes and condition variables a bit which all are variations of the same theme, a lock for a section that should only be used by one thread at a time.
Are there any different solutions to synchronisation than using locks? I've read lock-free synchronization at places, but some consider hiding the locks in containers to be lock-free, which I disagree with. you just don't explicitly use the locks yourself.
Lock-free algorithms typically involve using compare-and-swap (CAS) or similar CPU instructions that update some value in memory not only atomically, but also conditionally and with an indicator of success. That way you can code something like this:
1 do
2 {
3 current_value = the_varibale
4 new_value = ...some expression using current_value...
5 } while(!compare_and_swap(the_variable, current_value, new_value));
compare_and_swap() atomically checks whether the_variable's value is still current_value, and only if that's so will it update the_variable's value to new_value and return true
exact calling syntax will vary with the CPU, and may involve assembly language or system/compiler-provided wrapper functions (use the latter if available - there may be other compiler optimisations or issues that their usage restricts to safe behaviours); generally, check your docs
The significance is that when another thread updates the variable after the read on line 3 but before the CAS on line 5 attempts the update, the compare and swap instruction will fail because the state from which you're updating is not the one you used to calculate the desired target state. Such do/while loops can be said to "spin" rather than lock, as they go round and round the loop until CAS succeeds.
Crucially, your existing threading library can be expected to have a two-stage locking approach for mutex, read-write locks etc. involving:
First stage: spinning using CAS or similar (i.e. spin on { read the current value, if it's not set then cas(current = not set, new = set) }) - which means other threads doing a quick update often won't result in your thread swapping out to wait, and all the relatively time-consuming overheads associated with that.
The second stage is only used if some limit of loop iterations or elapsed time is exceeded: it asks the operating system to queue the thread until it knows (or at least suspects) the lock is free to acquire.
The implication of this is that if you're using a mutex to protect access to a variable, then you are unlikely to do any better by implementing your own CAS-based "mutex" to protect the same variable.
Lock free algorithms come into their own when you are working directly on a variable that's small enough to update directly with the CAS instruction itself. Instead of being...
get a mutex (by spinning on CAS, falling back on slower OS queue)
update variable
release mutex
...they're simplified (and made faster) by simply having the spin on CAS do the variable update directly. Of course, you may find the work to calculate new value from old painful to repeat speculatively, but unless there's a LOT of contention you're not wasting that effort often.
This ability to update only a single location in memory has far-reaching implications, and work-arounds can require some creativity. For example, if you had a container using lock-free algorithms, you may decide to calculate a potential change to an element in the container, but can't sync that with updating a size variable elsewhere in memory. You may need to live without size, or be able to use an approximate size where you do a CAS-spin to increment or decrement the size later, but any given read of size may be slightly wrong. You may need to merge two logically-related data structures - such as a free list and the element-container - to share an index, then bit-pack the core fields for each into the same atomically-sized word at the start of each record. These kinds of data optimisations can be very invasive, and sometimes won't get you the behavioural characteristics you'd like. Mutexes et al are much easier in this regard, and at least you know you won't need a rewrite to mutexes if requirements evolve just that step too far. That said, clever use of a lock-free approach really can be adequate for a lot of needs, and yield a very gratifying performance and scalability improvement.
A core (good) consequence of lock-free algorithms is that one thread can't be holding the mutex then happen to get swapped out by the scheduler, such that other threads can't work until it resumes; rather - with CAS - they can spin safely and efficiently without an OS fallback option.
Things that lock free algorithms can be good for include updating usage/reference counters, modifying pointers to cleanly switch the pointed-to data, free lists, linked lists, marking hash-table buckets used/unused, and load-balancing. Many others of course.
As you say, simply hiding use of mutexes behind some API is not lock free.
There are a lot of different approaches to synchronization. There are various variants of message-passing (for example, CSP) or transactional memory.
Both of these may be implemented using locks, but that's an implementation detail.
And then of course, for some purposes, there are lock-free algorithms or data-structures, which make do with just a few atomic instructions (such as compare-and-swap), but this isn't really a general-purpose replacement for locks.
There are several implementations of some data structures, which can be implemented in a lock free configuration. For example, the producer/consumer pattern can often be implemented using lock-free linked list structures.
However, most lock-free solutions require significant thought on the part of the person designing the specific program/specific problem domain. They aren't generally applicable for all problems. For examples of such implementations, take a look at Intel's Threading Building Blocks library.
Most important to note is that no lock-free solution is free. You're going to give something up to make that work, at the bare minimum in implementation complexity, and probably performance in scenarios where you're running on a single core (for example, a linked list is MUCH slower than a vector). Make sure you benchmark before using lock free on the base assumption that it would be faster.
Side note: I really hope you're not using condition variables, because there's no way to ensure that their access operates as you wish in C and C++.
Yet another library to add to your reading list: Fast Flow
What's interesting in your case is that they are based on lock-free queues. They have implemented a simple lock-free queue and then have built more complex queues out of it.
And since the code is free, you can peruse it and get the code for the lock-free queue, which is far from trivial to get right.