Synchronized unordered_map in C++ - c++

I am using unordered_map from Boost. Are there any synchronized version of unordered_map? This is because I have quite a large number of unordered_map and manually synchronizing it using lock would be very messy.
Thanks.

It's impossible to usefully encapsulate containers offering STL-like interfaces (which unordered_map also does) with automatic locking because there are race conditions associated with retrieving iterators and positions inside the string then trying to use them in later operations. If you can find some less flexible interface that suits your needs, perhaps putting any complex operations into single locked function calls, then you can easily wrap a thread-safe class around the container to simplify your usage.

Are you sure that is what you need ?
while (!stack.empty())
{
Element const e = stack.top();
stack.pop();
}
In a single thread, this code looks right. If you wish to go multi-thread however, simply having a synchronized stack just doesn't cut it.
What happens if anyone else pops the last element AFTER you tested for emptiness ?
There is more than container synchronization to go multi-thread. That said, you could try TBB out.

Use Folly's AtomicHashmap.
From Folly's documentation on Github
folly/AtomicHashmap.h introduces a synchronized UnorderedAssociativeContainer implementation designed for extreme performance in heavily multithreaded environments (about 2-5x faster than tbb::concurrent_hash_map) and good memory usage properties. Find and iteration are wait-free, insert has key-level lock granularity, there is minimal memory overhead, and permanent 32-bit ids can be used to reference each element.
It comes with some limitations though.

Intel's Thread Building Blocks library has a class tbb::concurrent_hash_map that is an unordered map, allowing concurrent access. Internally it is implemented using a fine-grained locking scheme, but the basic outcome is that you can access it without race conditions.

Related

How to LRU-cache numerous objects made of C++ STL heavy structures?

I have big C++/STL data structures (myStructType) with imbricated lists and maps. I have many objects of this type I want to LRU-cache with a key. I can reload objects from disk when needed. Moreover, it has to be shared in a multiprocessing high performance application running on a BSD plateform.
I can see several solutions:
I can consider a life-time sorted list of pair<size_t lifeTime, myStructType v> plus a map to o(1) access the index of the desired object in the list from its key, I can use shm and mmap to store everything, and a lock to manage access (cf here).
I can use a redis server configured for LRU, and redesign my data structures to redis key/value and key/lists pairs.
I can use a redis server configured for LRU, and serialise my data structures (myStructType) to have a simple key/value to manage with redis.
There may be other solutions of course. How would you do that, or better, how have you successfully done that, keeping in mind high performance ?
In addition, I would like to avoid heavy dependencies like Boost.
I actually built caches (not only LRU) recently.
Options 2 and 3 are quite likely not faster than re-reading from disk. That's effectively no cache at all. Also, this would be a far heavier dependency than Boost.
Option 1 can be challenging. For instance, you suggest "a lock". That would be quite a contended lock, as it must protect each and every lifetime update, plus all LRU operations. Since your objects are already heavy, it may be worthwhile to have a unique lock per object. There are intermediate variants of this solution, where there is more than one lock, but also more than one object per lock. (You still need a key to protect the whole map, but that's for replacement only)
You can also consider if you really need strict LRU. That strategy assumes that the chances of an object being reused decreases over time. If that's not actually true, random replacement is just as good. You can also consider evicting more than one element at a time. One of the challenges is that when an element needs removing, it would be so from all threads, but it's sufficient if one thread removes it. That's why a batch removal helps: if a thread tries to take a lock for batch removal and it fails, it can continue under the assumption that the cache will have free space soon.
One quick win is to not update the LRU time of the last used element. It was already the newest, making it any newer won't help. This of course only has an effect if you often use that element quickly again, but (as noted above) otherwise you'd just use random eviction.

C++ HashMap with multi-threading support [duplicate]

This question already has answers here:
ConcurrentHashMap for c++
(4 answers)
Closed 9 years ago.
I have a need to use a HashMap/ HashTable implementation in C++ and i have the following requirements
1- When new data is being inserted in the hashmap the complete hashmap is not locked, and other threads are allowed to read and also update other keys/ values in the hashmap
2- Multiple keys/ values should be updateable at the same time. i.e. one thread updating key x while the other thread updating key y.
Does such an implementation exist in C++ stl or any other libraries out there? or do i need to write something of my own?
I believe that both the Microsoft PPL implementation of concurrent_unordered_map and the Intel TBB implementation currently have a lock per hash bin. TBB also has a concurrent_hash_map with slightly different semantics. None of them guarantees any amount of concurrency in their spec. The only thing the specs guarantee is lack of data races.
If your algorithm is going to be this performance sensitive to the performance of concurrent hash table writes, you are probably in trouble, though. The cost of acquiring and releasing the locks on every access is similar to the cost of doing the hash-table insert. So you are going to lose half your performance to locking overhead, and need that much more parallelism to recover it. Are you sure you can't have a hash-table per thread, and then merge all the hash tables when you are done? (Some algorithms will let you get away with this, others not.)
Edit: I just noticed you are asking to be able to update keys concurrently. This is simply not possible under any concurrent hash table implementation that I'm aware of. The reason is that this is a read-modify-update operation, which, as #JoopEggen pointed out, just don't work with concurrent container types. In fact, this is a read-modify-update-modify-update operation. In order to modify keys you would need to make the whole sequence of operations atomic. Concurrent container types are monitors: each individual method call can be atomic, but sequences of them are not.

Is it possible to implement lock free map in C++

We are developing a network application based C/S, we find there are too many locks adding to std::map that the performance of server became poor.
I wonder if it is possible to implement a lock-free map, if yes, how? Is there any open source code there?
EDIT:
Actually we use the std::map to store sockets information, we did encapsulation based on the socket file description to include some other necessary information such as ip address, port, socket type, tcp or udp, etc.
To summary, we have a global map say it's
map<int fileDescriptor, socketInfor*> SocketsMap,
then every thread which is used to send data needs to access SocketsMap, and they have to add mutex before reading from SocketsMap or writing to SocketsMap, thus the concurrency level of the whole application would be greatly decreased because of so many locks addding to SocketsMap.
To avoid the concurrency level problem, we have two solutions: 1. store each socketInfor* separately 2. use some kind of lock-free map.
I would like to find some kind of lock-free map, because codes changes required by this solution are much less than that of solution 1.
Actually there's a way, although I haven't implemented it myself there's a paper on a lock free map using hazard pointers from eminent C++ expert Andrei Alexandrescu.
Yes, I have implemented a Lock-Free Unordered Map (docs) in C++ using the "Split-Ordered Lists" concept. It's an auto-expanding container and supports millions of elements on a 64-bit CAS without ABA issues. Performance-wise, it's a beast (see page 5). It's been extensively tested with millions of random ops.
HashMap would suit? Have a look at Intel Threading Building Blocks, they have an interesting concurrent map. I'm not sure it's lock-free, but hopefully you're interested in good multithreading performance, not particularly in lock-freeness. Also you can check CityHash lib
EDIT:
Actually TBB's hash map is not lock-free
I'm surprised nobody has mentioned it, but Click Cliff has implemented a wait-free hashmap in Java, which I believe could be ported to C++,
If you use C++11, you can have a look at AtomicHashMap of facebook/folly
You can implement the map using optimistic design or transactional memory.
This approach is especially effective if the chance of two operations concurrently addressing the map and one is changing the structure of it is relatively small - and you do not want the overhead of locking every time.
However, from time to time - collision will occur, and you will have to result it somehow (usually by rolling back to the last stable state and retrying the operations).
If your hardware support good enough atomic operations - this can be easily done with Compare And Swap (CAS) - where you change the reference alone (and whenever you change the map, you work on a copy of the map, and not the original, and set it as the primary only when you commit).

Design Problem: Thread safety of std::map

I am using std::map to implement my local hash table, which will be accessed by multiple threads at the same time.
I did some research and found that std::map is not thread safe.
So I will use a mutex for insert and delete operations on the map.
I plan to have separate mutex(es), one for each map entry so that they can be modified independently.
Do I need to put find operation also under critical section?
Will find operation be affected by insert/delete operations?
Is there any better implementation than using std::map that can take care of everything?
Binary trees are not particularly suited to Multi-Threading because the rebalancing can degenerate in a tree-wide modification. Furthermore, a global mutex will very negatively access the performance.
I would strongly suggest using an already written thread-safe containers. For example, Intel TBB contains a concurrent_hash_map.
If you wish to learn however, here are some hints on building a concurrent sorted associative container (I believe a full introduction to be not only out of my reach but also out of place, here).
Reader/Writer
Rather than a regular Mutex, you may want to use a Reader/Writer Mutex. This means parallelizing Reads, while Writes remain strictly sequential.
Own Tree
You can also build your own red-black or AVL tree. By augmenting the tree structure with a Reader/Writer Mutex per node. This allows you to only block part of the tree, rather than the whole structure, even when rebalancing. eg inserts with keys far enough apart can be parallel.
Skip Lists
Linked lists are much more amenable to concurrent manipulations, because you can easily isolate the modified zone.
A Skip List builds on this strength, but augments the structure to provide O(log N) access by key.
The typical way to walk a list is using the hand over hand idiom, that is, you grab the mutex of the next node before releasing the one of the current node. Skip Lists add a 2nd dimension as you can dive between two nodes, thus releasing both of them (and letting other walkers go ahead of you).
Implementations are much simpler than for binary search trees.
Persistent
Another interesting piece is the idea of persistent (or semi-persistent) data-structures, often found in functional programming. Binary Search Tree are particularly amenable for it.
The basic idea is to never change a node (or its content) once it exists. You do so by sharing a mutable head, that will point to the later version.
To Read: you copy the current head, then use it without worry (the information is immutable)
To Write: each node that you would modify in a regular tree is instead copied and the copy modified, therefore you rebuild part of the tree (up to the root) each time, and update the head to point to the new root. There are efficient ways to rebalance on descending the tree. Writes are sequential
The main advantage is that a version of the map is always available. That is, you can always read even when another thread is performing an insert or delete. Furthermore, because read access only require a single concurrent read (when copying the root pointer), they are near lock-free, and thus have excellent performance.
Reference counting (intrinsic) is your friend for those nodes.
Note: copies of the tree are very cheap :)
I do not know any implementation in C++ of either a concurrent Skip List or a concurrent Semi-Persistent Binary Search Tree.
You will in deed need to put find in a critical section, but you might want to have two different locks, one for writing and one for reading. The write lock is exclusive but if no thread holds the write lock several threads may read concurrently with no problems.
Such an implementation would work with most STL implementations but it would not be standards compliant, however. std::map is usually implemented using a red-black tree which doesn't change when elements are read. If the map was implemented using a splay tree instead, the tree would change during lookup and only one thread could read at a time.
For most purposes I would recommend using two locks.
Yes, if the insert or delete results in a rebalance I believe that find could be affected too.
Yes - You would need to put insert, delete and find in a critical section. There are techniques to enable multiple finds at the same time.
From what I can see, a similar question has been answered here, and the answer includes the explanation for this question also, as well as a link explaining the thread safety in more details.
Thread safety of std::map for read-only operations

Alternatives for locks for synchronisation

I'm currently in the process of developing my own little threading library, mainly for learning purposes, and am at the part of the message queue which will involve a lot of synchronisation in various places. Previously I've mainly used locks, mutexes and condition variables a bit which all are variations of the same theme, a lock for a section that should only be used by one thread at a time.
Are there any different solutions to synchronisation than using locks? I've read lock-free synchronization at places, but some consider hiding the locks in containers to be lock-free, which I disagree with. you just don't explicitly use the locks yourself.
Lock-free algorithms typically involve using compare-and-swap (CAS) or similar CPU instructions that update some value in memory not only atomically, but also conditionally and with an indicator of success. That way you can code something like this:
1 do
2 {
3 current_value = the_varibale
4 new_value = ...some expression using current_value...
5 } while(!compare_and_swap(the_variable, current_value, new_value));
compare_and_swap() atomically checks whether the_variable's value is still current_value, and only if that's so will it update the_variable's value to new_value and return true
exact calling syntax will vary with the CPU, and may involve assembly language or system/compiler-provided wrapper functions (use the latter if available - there may be other compiler optimisations or issues that their usage restricts to safe behaviours); generally, check your docs
The significance is that when another thread updates the variable after the read on line 3 but before the CAS on line 5 attempts the update, the compare and swap instruction will fail because the state from which you're updating is not the one you used to calculate the desired target state. Such do/while loops can be said to "spin" rather than lock, as they go round and round the loop until CAS succeeds.
Crucially, your existing threading library can be expected to have a two-stage locking approach for mutex, read-write locks etc. involving:
First stage: spinning using CAS or similar (i.e. spin on { read the current value, if it's not set then cas(current = not set, new = set) }) - which means other threads doing a quick update often won't result in your thread swapping out to wait, and all the relatively time-consuming overheads associated with that.
The second stage is only used if some limit of loop iterations or elapsed time is exceeded: it asks the operating system to queue the thread until it knows (or at least suspects) the lock is free to acquire.
The implication of this is that if you're using a mutex to protect access to a variable, then you are unlikely to do any better by implementing your own CAS-based "mutex" to protect the same variable.
Lock free algorithms come into their own when you are working directly on a variable that's small enough to update directly with the CAS instruction itself. Instead of being...
get a mutex (by spinning on CAS, falling back on slower OS queue)
update variable
release mutex
...they're simplified (and made faster) by simply having the spin on CAS do the variable update directly. Of course, you may find the work to calculate new value from old painful to repeat speculatively, but unless there's a LOT of contention you're not wasting that effort often.
This ability to update only a single location in memory has far-reaching implications, and work-arounds can require some creativity. For example, if you had a container using lock-free algorithms, you may decide to calculate a potential change to an element in the container, but can't sync that with updating a size variable elsewhere in memory. You may need to live without size, or be able to use an approximate size where you do a CAS-spin to increment or decrement the size later, but any given read of size may be slightly wrong. You may need to merge two logically-related data structures - such as a free list and the element-container - to share an index, then bit-pack the core fields for each into the same atomically-sized word at the start of each record. These kinds of data optimisations can be very invasive, and sometimes won't get you the behavioural characteristics you'd like. Mutexes et al are much easier in this regard, and at least you know you won't need a rewrite to mutexes if requirements evolve just that step too far. That said, clever use of a lock-free approach really can be adequate for a lot of needs, and yield a very gratifying performance and scalability improvement.
A core (good) consequence of lock-free algorithms is that one thread can't be holding the mutex then happen to get swapped out by the scheduler, such that other threads can't work until it resumes; rather - with CAS - they can spin safely and efficiently without an OS fallback option.
Things that lock free algorithms can be good for include updating usage/reference counters, modifying pointers to cleanly switch the pointed-to data, free lists, linked lists, marking hash-table buckets used/unused, and load-balancing. Many others of course.
As you say, simply hiding use of mutexes behind some API is not lock free.
There are a lot of different approaches to synchronization. There are various variants of message-passing (for example, CSP) or transactional memory.
Both of these may be implemented using locks, but that's an implementation detail.
And then of course, for some purposes, there are lock-free algorithms or data-structures, which make do with just a few atomic instructions (such as compare-and-swap), but this isn't really a general-purpose replacement for locks.
There are several implementations of some data structures, which can be implemented in a lock free configuration. For example, the producer/consumer pattern can often be implemented using lock-free linked list structures.
However, most lock-free solutions require significant thought on the part of the person designing the specific program/specific problem domain. They aren't generally applicable for all problems. For examples of such implementations, take a look at Intel's Threading Building Blocks library.
Most important to note is that no lock-free solution is free. You're going to give something up to make that work, at the bare minimum in implementation complexity, and probably performance in scenarios where you're running on a single core (for example, a linked list is MUCH slower than a vector). Make sure you benchmark before using lock free on the base assumption that it would be faster.
Side note: I really hope you're not using condition variables, because there's no way to ensure that their access operates as you wish in C and C++.
Yet another library to add to your reading list: Fast Flow
What's interesting in your case is that they are based on lock-free queues. They have implemented a simple lock-free queue and then have built more complex queues out of it.
And since the code is free, you can peruse it and get the code for the lock-free queue, which is far from trivial to get right.