This question already has answers here:
ConcurrentHashMap for c++
(4 answers)
Closed 9 years ago.
I have a need to use a HashMap/ HashTable implementation in C++ and i have the following requirements
1- When new data is being inserted in the hashmap the complete hashmap is not locked, and other threads are allowed to read and also update other keys/ values in the hashmap
2- Multiple keys/ values should be updateable at the same time. i.e. one thread updating key x while the other thread updating key y.
Does such an implementation exist in C++ stl or any other libraries out there? or do i need to write something of my own?
I believe that both the Microsoft PPL implementation of concurrent_unordered_map and the Intel TBB implementation currently have a lock per hash bin. TBB also has a concurrent_hash_map with slightly different semantics. None of them guarantees any amount of concurrency in their spec. The only thing the specs guarantee is lack of data races.
If your algorithm is going to be this performance sensitive to the performance of concurrent hash table writes, you are probably in trouble, though. The cost of acquiring and releasing the locks on every access is similar to the cost of doing the hash-table insert. So you are going to lose half your performance to locking overhead, and need that much more parallelism to recover it. Are you sure you can't have a hash-table per thread, and then merge all the hash tables when you are done? (Some algorithms will let you get away with this, others not.)
Edit: I just noticed you are asking to be able to update keys concurrently. This is simply not possible under any concurrent hash table implementation that I'm aware of. The reason is that this is a read-modify-update operation, which, as #JoopEggen pointed out, just don't work with concurrent container types. In fact, this is a read-modify-update-modify-update operation. In order to modify keys you would need to make the whole sequence of operations atomic. Concurrent container types are monitors: each individual method call can be atomic, but sequences of them are not.
Related
I have big C++/STL data structures (myStructType) with imbricated lists and maps. I have many objects of this type I want to LRU-cache with a key. I can reload objects from disk when needed. Moreover, it has to be shared in a multiprocessing high performance application running on a BSD plateform.
I can see several solutions:
I can consider a life-time sorted list of pair<size_t lifeTime, myStructType v> plus a map to o(1) access the index of the desired object in the list from its key, I can use shm and mmap to store everything, and a lock to manage access (cf here).
I can use a redis server configured for LRU, and redesign my data structures to redis key/value and key/lists pairs.
I can use a redis server configured for LRU, and serialise my data structures (myStructType) to have a simple key/value to manage with redis.
There may be other solutions of course. How would you do that, or better, how have you successfully done that, keeping in mind high performance ?
In addition, I would like to avoid heavy dependencies like Boost.
I actually built caches (not only LRU) recently.
Options 2 and 3 are quite likely not faster than re-reading from disk. That's effectively no cache at all. Also, this would be a far heavier dependency than Boost.
Option 1 can be challenging. For instance, you suggest "a lock". That would be quite a contended lock, as it must protect each and every lifetime update, plus all LRU operations. Since your objects are already heavy, it may be worthwhile to have a unique lock per object. There are intermediate variants of this solution, where there is more than one lock, but also more than one object per lock. (You still need a key to protect the whole map, but that's for replacement only)
You can also consider if you really need strict LRU. That strategy assumes that the chances of an object being reused decreases over time. If that's not actually true, random replacement is just as good. You can also consider evicting more than one element at a time. One of the challenges is that when an element needs removing, it would be so from all threads, but it's sufficient if one thread removes it. That's why a batch removal helps: if a thread tries to take a lock for batch removal and it fails, it can continue under the assumption that the cache will have free space soon.
One quick win is to not update the LRU time of the last used element. It was already the newest, making it any newer won't help. This of course only has an effect if you often use that element quickly again, but (as noted above) otherwise you'd just use random eviction.
We are developing a network application based C/S, we find there are too many locks adding to std::map that the performance of server became poor.
I wonder if it is possible to implement a lock-free map, if yes, how? Is there any open source code there?
EDIT:
Actually we use the std::map to store sockets information, we did encapsulation based on the socket file description to include some other necessary information such as ip address, port, socket type, tcp or udp, etc.
To summary, we have a global map say it's
map<int fileDescriptor, socketInfor*> SocketsMap,
then every thread which is used to send data needs to access SocketsMap, and they have to add mutex before reading from SocketsMap or writing to SocketsMap, thus the concurrency level of the whole application would be greatly decreased because of so many locks addding to SocketsMap.
To avoid the concurrency level problem, we have two solutions: 1. store each socketInfor* separately 2. use some kind of lock-free map.
I would like to find some kind of lock-free map, because codes changes required by this solution are much less than that of solution 1.
Actually there's a way, although I haven't implemented it myself there's a paper on a lock free map using hazard pointers from eminent C++ expert Andrei Alexandrescu.
Yes, I have implemented a Lock-Free Unordered Map (docs) in C++ using the "Split-Ordered Lists" concept. It's an auto-expanding container and supports millions of elements on a 64-bit CAS without ABA issues. Performance-wise, it's a beast (see page 5). It's been extensively tested with millions of random ops.
HashMap would suit? Have a look at Intel Threading Building Blocks, they have an interesting concurrent map. I'm not sure it's lock-free, but hopefully you're interested in good multithreading performance, not particularly in lock-freeness. Also you can check CityHash lib
EDIT:
Actually TBB's hash map is not lock-free
I'm surprised nobody has mentioned it, but Click Cliff has implemented a wait-free hashmap in Java, which I believe could be ported to C++,
If you use C++11, you can have a look at AtomicHashMap of facebook/folly
You can implement the map using optimistic design or transactional memory.
This approach is especially effective if the chance of two operations concurrently addressing the map and one is changing the structure of it is relatively small - and you do not want the overhead of locking every time.
However, from time to time - collision will occur, and you will have to result it somehow (usually by rolling back to the last stable state and retrying the operations).
If your hardware support good enough atomic operations - this can be easily done with Compare And Swap (CAS) - where you change the reference alone (and whenever you change the map, you work on a copy of the map, and not the original, and set it as the primary only when you commit).
I am using unordered_map from Boost. Are there any synchronized version of unordered_map? This is because I have quite a large number of unordered_map and manually synchronizing it using lock would be very messy.
Thanks.
It's impossible to usefully encapsulate containers offering STL-like interfaces (which unordered_map also does) with automatic locking because there are race conditions associated with retrieving iterators and positions inside the string then trying to use them in later operations. If you can find some less flexible interface that suits your needs, perhaps putting any complex operations into single locked function calls, then you can easily wrap a thread-safe class around the container to simplify your usage.
Are you sure that is what you need ?
while (!stack.empty())
{
Element const e = stack.top();
stack.pop();
}
In a single thread, this code looks right. If you wish to go multi-thread however, simply having a synchronized stack just doesn't cut it.
What happens if anyone else pops the last element AFTER you tested for emptiness ?
There is more than container synchronization to go multi-thread. That said, you could try TBB out.
Use Folly's AtomicHashmap.
From Folly's documentation on Github
folly/AtomicHashmap.h introduces a synchronized UnorderedAssociativeContainer implementation designed for extreme performance in heavily multithreaded environments (about 2-5x faster than tbb::concurrent_hash_map) and good memory usage properties. Find and iteration are wait-free, insert has key-level lock granularity, there is minimal memory overhead, and permanent 32-bit ids can be used to reference each element.
It comes with some limitations though.
Intel's Thread Building Blocks library has a class tbb::concurrent_hash_map that is an unordered map, allowing concurrent access. Internally it is implemented using a fine-grained locking scheme, but the basic outcome is that you can access it without race conditions.
I'm currently in the process of developing my own little threading library, mainly for learning purposes, and am at the part of the message queue which will involve a lot of synchronisation in various places. Previously I've mainly used locks, mutexes and condition variables a bit which all are variations of the same theme, a lock for a section that should only be used by one thread at a time.
Are there any different solutions to synchronisation than using locks? I've read lock-free synchronization at places, but some consider hiding the locks in containers to be lock-free, which I disagree with. you just don't explicitly use the locks yourself.
Lock-free algorithms typically involve using compare-and-swap (CAS) or similar CPU instructions that update some value in memory not only atomically, but also conditionally and with an indicator of success. That way you can code something like this:
1 do
2 {
3 current_value = the_varibale
4 new_value = ...some expression using current_value...
5 } while(!compare_and_swap(the_variable, current_value, new_value));
compare_and_swap() atomically checks whether the_variable's value is still current_value, and only if that's so will it update the_variable's value to new_value and return true
exact calling syntax will vary with the CPU, and may involve assembly language or system/compiler-provided wrapper functions (use the latter if available - there may be other compiler optimisations or issues that their usage restricts to safe behaviours); generally, check your docs
The significance is that when another thread updates the variable after the read on line 3 but before the CAS on line 5 attempts the update, the compare and swap instruction will fail because the state from which you're updating is not the one you used to calculate the desired target state. Such do/while loops can be said to "spin" rather than lock, as they go round and round the loop until CAS succeeds.
Crucially, your existing threading library can be expected to have a two-stage locking approach for mutex, read-write locks etc. involving:
First stage: spinning using CAS or similar (i.e. spin on { read the current value, if it's not set then cas(current = not set, new = set) }) - which means other threads doing a quick update often won't result in your thread swapping out to wait, and all the relatively time-consuming overheads associated with that.
The second stage is only used if some limit of loop iterations or elapsed time is exceeded: it asks the operating system to queue the thread until it knows (or at least suspects) the lock is free to acquire.
The implication of this is that if you're using a mutex to protect access to a variable, then you are unlikely to do any better by implementing your own CAS-based "mutex" to protect the same variable.
Lock free algorithms come into their own when you are working directly on a variable that's small enough to update directly with the CAS instruction itself. Instead of being...
get a mutex (by spinning on CAS, falling back on slower OS queue)
update variable
release mutex
...they're simplified (and made faster) by simply having the spin on CAS do the variable update directly. Of course, you may find the work to calculate new value from old painful to repeat speculatively, but unless there's a LOT of contention you're not wasting that effort often.
This ability to update only a single location in memory has far-reaching implications, and work-arounds can require some creativity. For example, if you had a container using lock-free algorithms, you may decide to calculate a potential change to an element in the container, but can't sync that with updating a size variable elsewhere in memory. You may need to live without size, or be able to use an approximate size where you do a CAS-spin to increment or decrement the size later, but any given read of size may be slightly wrong. You may need to merge two logically-related data structures - such as a free list and the element-container - to share an index, then bit-pack the core fields for each into the same atomically-sized word at the start of each record. These kinds of data optimisations can be very invasive, and sometimes won't get you the behavioural characteristics you'd like. Mutexes et al are much easier in this regard, and at least you know you won't need a rewrite to mutexes if requirements evolve just that step too far. That said, clever use of a lock-free approach really can be adequate for a lot of needs, and yield a very gratifying performance and scalability improvement.
A core (good) consequence of lock-free algorithms is that one thread can't be holding the mutex then happen to get swapped out by the scheduler, such that other threads can't work until it resumes; rather - with CAS - they can spin safely and efficiently without an OS fallback option.
Things that lock free algorithms can be good for include updating usage/reference counters, modifying pointers to cleanly switch the pointed-to data, free lists, linked lists, marking hash-table buckets used/unused, and load-balancing. Many others of course.
As you say, simply hiding use of mutexes behind some API is not lock free.
There are a lot of different approaches to synchronization. There are various variants of message-passing (for example, CSP) or transactional memory.
Both of these may be implemented using locks, but that's an implementation detail.
And then of course, for some purposes, there are lock-free algorithms or data-structures, which make do with just a few atomic instructions (such as compare-and-swap), but this isn't really a general-purpose replacement for locks.
There are several implementations of some data structures, which can be implemented in a lock free configuration. For example, the producer/consumer pattern can often be implemented using lock-free linked list structures.
However, most lock-free solutions require significant thought on the part of the person designing the specific program/specific problem domain. They aren't generally applicable for all problems. For examples of such implementations, take a look at Intel's Threading Building Blocks library.
Most important to note is that no lock-free solution is free. You're going to give something up to make that work, at the bare minimum in implementation complexity, and probably performance in scenarios where you're running on a single core (for example, a linked list is MUCH slower than a vector). Make sure you benchmark before using lock free on the base assumption that it would be faster.
Side note: I really hope you're not using condition variables, because there's no way to ensure that their access operates as you wish in C and C++.
Yet another library to add to your reading list: Fast Flow
What's interesting in your case is that they are based on lock-free queues. They have implemented a simple lock-free queue and then have built more complex queues out of it.
And since the code is free, you can peruse it and get the code for the lock-free queue, which is far from trivial to get right.
I'm looking for a associative container of some sort that provides safe concurrent read & write access provided you're never simultaneously reading and writing the same element.
Basically I have this setup:
Thread 1: Create A, Write A to container, Send A over the network.
Thread 2: Receive response to A, Read A from container, do some processing.
I can guarantee that we only ever write A once, though we may receive multiple responses for A which will be processed serially. This also guarantees that we never read and write A at the same time, since we can only receive a response to A after sending it.
So basically I'm looking for a container where writing to an element doesn't mess with any other elements. For example, std::map (or any other tree-based implementation) does not satisfy this condition because its underlying implementation is a red-black tree, so any given write may rebalance the tree and blow up any concurrent read operations.
I think that std::hash_map or boost::unordered_set may work for this, just based on my assumption that a normal hash table implementation would satisfy my criteria, but I'm not positive and I can't find any documentation that would tell me. Has anybody else tried using these similarly?
The STL won't provide any solid guarantees about threads, since the C++ standard doesn't mention threads at all. I don't know about boost, but I'd be surprised if its containers made any concurrency guarantees.
What about concurrent_hash_map from TBB? I found this in this related SO question.
Common hash table implementation have rehashing when the number of stored element increase, so that's probably not an option excepted if you know that this doesn't happen.
I'd look at structures used for functional languages (for instance look at http://www.cs.cmu.edu/~rwh/theses/okasaki.pdf) but note that those I'm currently thinking about depend on garbage collection.