I have an assignment that requires setting up a data structure with concurrent reads/writes (an order book for a matching engine in a trading exchange), and I have settled on concurrent linked/skip lists. I've looked at several of the following articles/reports, where some are lock-free, and many are fine-grained locked (listed below in no particular order):
Practical concurrent unrolled lists using lazy synchronisation
Practical lock-freedom page 53
A Contention-Friendly, Non-Blocking Skip List
A Provably Correct Concurrent Skip List
A Simple Optimistic Skiplist Algorithm
All of these have fairly detailed algorithm pseudocode listings, but there are two issues I note with all of them:
They are all maps—they associate some key to some value, whereas I need the Node class in all these algorithms to simply contain some struct T (more particularly, I need the number of units in that order, the unit selling/buying price, the order ID, and an insertion timestamp). MSDN has a very nice C# implementation of a skip list containing some T (albeit not concurrent, which is a strict requirement), and is straightforward to adapt to C++ (ergo the tag).
They don't have update and get operations—what they do have are find (which returns a boolean value, and not the node value itself), insert, and delete operations. I am wondering if I can somehow compose and modify the latter three to create the former two, but I am a little lost.
How might I implement get/update so that concurrency and thread ordering is maintained?
I am leaning towards fine-grained locking (which is easier to reason about, even if slower) than lock-free algorithms (which I don't fully understand).
Related
I'm reading up on datastructures, especially immutable ones like the append-only B+ tree used in CouchDB and the Hash array mapped trie used in Clojure and some other functional programming languages.
The main reason datastructures that work well in memory might not work well on disk appears to be time spent on disk seeks due to fragmentation, as with a normal binary tree.
However, HAMT is also very shallow, so doesn't require any more seeks than a B tree.
Another suggested reason is that deletions from a array mapped trie are more expensive tha from a B tree. This is based on the assumption that we're talking about a dense vector, and doesn't apply when using either as a hash map.
What's more, it seems that a B tree does more rebalancing, so using it in an append-only manner produces more garbage.
So why do CouchDB and practically every other database and filesystem use B trees?
[edit] fractal trees? log-structured merge tree? mind = blown
[edit] Real-life B trees use a degree in the thousands, while a HAMT has a degree of 32. A HAMT of degree 1024 would be possible, but slower due to popcnt handling 32 or 64 bits at a time.
B-trees are used because they are a well-understood algorithm that achieves "ideal" sorted-order read-cost. Because keys are sorted, moving to the next or previous key is very cheap.
HAMTs or other hash storage, stores keys in random order. Keys are retrieved by their exact value, and there is no efficient way to find to the next or previous key.
Regarding degree, it is normally selected indirectly, by selecting page size. HAMTs are most often used in memory, with pages sized for cache lines, while B-trees are most often used with secondary storage, where page sizes are related to IO and VM parameters.
Log Structured Merge (LSM) is a different approach to sorted order storage which achieves more optimal write-efficiency, by trading off some read efficiency. That hit to read efficiency can be a problem for read-modify-write workloads, but the fewer uncached reads there are, the more LSM provides better overall throughput vs B-tree - at the cost of higher worst case read latency.
LSM also offers the promise of a wider-performance envelope. Putting new data into its proper place is "deferred", offering the possibility to tune read-to-write efficiency by controlling the proportion of deferred cleanup work to live work. In theory, an ideal-LSM with zero-deferral is a B-tree and with 100%-deferral is a log.
However, LSM is more of a "family" of algorithms than a specific algorithm like a B-tree. Their usage is growing in popularity, but it is hindered by the lack of a de-facto optimal LSM design. LevelDB/RocksDB is one of the more practical LSM implementations, but it is far from optimal.
Another approach to achieving write-throughput efficiency is to write-optimize B-trees through write-deferral, while attempting to maintain their optimal read-throughput.
Fractal-trees, shuttle-trees, stratified-trees are this type of design, and represent a hybrid gray area between B-tree and LSM. Rather than deferring writes to an offline process, they amortize write-deferral in a fixed way. For example, such a design might represent a fixed 60%-write-deferral fraction. This means they can't achieve the 100% write-deferral performance of an LSM, but they also have a more predictable read-performance, making them more practical drop-in replacements for B-trees. (As in the commercial Tokutek MySQL and MongoDB fractal-tree backends)
Btrees are ordered by their key while in a hash map similar keys have very different hash values so are stored far each other. Now think of a query that do a range scan "give me yesterday's sales": with a hash map you have to scan all the map to find them, with a btree on the sales_dtm columns you'll find them nicely clustered and you exactly know where to start and stop reading.
I am using std::map to implement my local hash table, which will be accessed by multiple threads at the same time.
I did some research and found that std::map is not thread safe.
So I will use a mutex for insert and delete operations on the map.
I plan to have separate mutex(es), one for each map entry so that they can be modified independently.
Do I need to put find operation also under critical section?
Will find operation be affected by insert/delete operations?
Is there any better implementation than using std::map that can take care of everything?
Binary trees are not particularly suited to Multi-Threading because the rebalancing can degenerate in a tree-wide modification. Furthermore, a global mutex will very negatively access the performance.
I would strongly suggest using an already written thread-safe containers. For example, Intel TBB contains a concurrent_hash_map.
If you wish to learn however, here are some hints on building a concurrent sorted associative container (I believe a full introduction to be not only out of my reach but also out of place, here).
Reader/Writer
Rather than a regular Mutex, you may want to use a Reader/Writer Mutex. This means parallelizing Reads, while Writes remain strictly sequential.
Own Tree
You can also build your own red-black or AVL tree. By augmenting the tree structure with a Reader/Writer Mutex per node. This allows you to only block part of the tree, rather than the whole structure, even when rebalancing. eg inserts with keys far enough apart can be parallel.
Skip Lists
Linked lists are much more amenable to concurrent manipulations, because you can easily isolate the modified zone.
A Skip List builds on this strength, but augments the structure to provide O(log N) access by key.
The typical way to walk a list is using the hand over hand idiom, that is, you grab the mutex of the next node before releasing the one of the current node. Skip Lists add a 2nd dimension as you can dive between two nodes, thus releasing both of them (and letting other walkers go ahead of you).
Implementations are much simpler than for binary search trees.
Persistent
Another interesting piece is the idea of persistent (or semi-persistent) data-structures, often found in functional programming. Binary Search Tree are particularly amenable for it.
The basic idea is to never change a node (or its content) once it exists. You do so by sharing a mutable head, that will point to the later version.
To Read: you copy the current head, then use it without worry (the information is immutable)
To Write: each node that you would modify in a regular tree is instead copied and the copy modified, therefore you rebuild part of the tree (up to the root) each time, and update the head to point to the new root. There are efficient ways to rebalance on descending the tree. Writes are sequential
The main advantage is that a version of the map is always available. That is, you can always read even when another thread is performing an insert or delete. Furthermore, because read access only require a single concurrent read (when copying the root pointer), they are near lock-free, and thus have excellent performance.
Reference counting (intrinsic) is your friend for those nodes.
Note: copies of the tree are very cheap :)
I do not know any implementation in C++ of either a concurrent Skip List or a concurrent Semi-Persistent Binary Search Tree.
You will in deed need to put find in a critical section, but you might want to have two different locks, one for writing and one for reading. The write lock is exclusive but if no thread holds the write lock several threads may read concurrently with no problems.
Such an implementation would work with most STL implementations but it would not be standards compliant, however. std::map is usually implemented using a red-black tree which doesn't change when elements are read. If the map was implemented using a splay tree instead, the tree would change during lookup and only one thread could read at a time.
For most purposes I would recommend using two locks.
Yes, if the insert or delete results in a rebalance I believe that find could be affected too.
Yes - You would need to put insert, delete and find in a critical section. There are techniques to enable multiple finds at the same time.
From what I can see, a similar question has been answered here, and the answer includes the explanation for this question also, as well as a link explaining the thread safety in more details.
Thread safety of std::map for read-only operations
This is just for a kind of concurrency refresher...
Imagine I have a B+ tree data structure in memory - multiple items per node, only leaf nodes contain items, leaf nodes also form a linked list for easy sequential access. Inserts and deletes mostly only affect a leaf node, but can cause nodes to split or merge in a process that may propagate to the root.
I have a single-thread implementation, and the updates follow a kind of pre-planning approach. A recursion steps up the tree from leaf level as far as nodes need to change, building a linked list (linking local variable in different recursions) that describes the changes needed. When it knows what's needed, it can check whether it can allocate all needed nodes, and apply all needed changes (or not) by referencing this plan before falling out of the recursion.
This implementation also "maintains" iterators on updates, so iterators aren't invalidated by inserts/deletes unless the specific item they point to is deleted. Inserts/deletes within the same node cause the iterators pointing into that node to be updated.
Trouble is, I need to make it multithreaded - supporting potentially many readers and writers at once.
I want multiple readers to be able to read and write at the same time, so long as there is no risk of corruption as a result. So for reading, I don't want mutually exclusive access at all, even to a single node. For writing, I want to lock the minimum number of nodes needed for the change. And I want to avoid deadlock, of course.
Thankfully, it isn't something I actually need to do - but since I've neglected my concurrency skills, this just seems like a good thought experiment.
This is obviously similar to the kinds of problems that databases and filesystems have to handle, so I'm guessing I might get some references to that kind of thing, which would be great.
So - how would I handle the thread synchronisation for this? I can vaguely see a role for mutexes and/or semaphores on nodes, but what strategies would I use to work with them?
Definitely challenging task! I see that you c++ programmer, however I believe that in c++ there are similar concepts as in java and I'll try to help from java standpoint.
So for reading, I don't want mutually exclusive access at all, even to a single node
You could use ReadWriteLock. It be held simultaneously by multiple reader threads, so long as there are no writers. The write lock is exclusive. You just have to use exclusive access when doing writing. Do you have analogue in c++?
And I want to avoid deadlock, of course.
Just lock multiple nodes in order of levels (eg from top to bottom). That will guarantee you protection from deadlocks(that would be smth similar to Lamport's Bakery Algorithm).
As for databases - they resolve deadlocks by killing one process :-).
One more strategy is to implement unblocking tree structure in the similar manner how Cliff Click implemented unblocking hash map(state machine with all cases covered):
video
Cheers
I'm currently in the process of developing my own little threading library, mainly for learning purposes, and am at the part of the message queue which will involve a lot of synchronisation in various places. Previously I've mainly used locks, mutexes and condition variables a bit which all are variations of the same theme, a lock for a section that should only be used by one thread at a time.
Are there any different solutions to synchronisation than using locks? I've read lock-free synchronization at places, but some consider hiding the locks in containers to be lock-free, which I disagree with. you just don't explicitly use the locks yourself.
Lock-free algorithms typically involve using compare-and-swap (CAS) or similar CPU instructions that update some value in memory not only atomically, but also conditionally and with an indicator of success. That way you can code something like this:
1 do
2 {
3 current_value = the_varibale
4 new_value = ...some expression using current_value...
5 } while(!compare_and_swap(the_variable, current_value, new_value));
compare_and_swap() atomically checks whether the_variable's value is still current_value, and only if that's so will it update the_variable's value to new_value and return true
exact calling syntax will vary with the CPU, and may involve assembly language or system/compiler-provided wrapper functions (use the latter if available - there may be other compiler optimisations or issues that their usage restricts to safe behaviours); generally, check your docs
The significance is that when another thread updates the variable after the read on line 3 but before the CAS on line 5 attempts the update, the compare and swap instruction will fail because the state from which you're updating is not the one you used to calculate the desired target state. Such do/while loops can be said to "spin" rather than lock, as they go round and round the loop until CAS succeeds.
Crucially, your existing threading library can be expected to have a two-stage locking approach for mutex, read-write locks etc. involving:
First stage: spinning using CAS or similar (i.e. spin on { read the current value, if it's not set then cas(current = not set, new = set) }) - which means other threads doing a quick update often won't result in your thread swapping out to wait, and all the relatively time-consuming overheads associated with that.
The second stage is only used if some limit of loop iterations or elapsed time is exceeded: it asks the operating system to queue the thread until it knows (or at least suspects) the lock is free to acquire.
The implication of this is that if you're using a mutex to protect access to a variable, then you are unlikely to do any better by implementing your own CAS-based "mutex" to protect the same variable.
Lock free algorithms come into their own when you are working directly on a variable that's small enough to update directly with the CAS instruction itself. Instead of being...
get a mutex (by spinning on CAS, falling back on slower OS queue)
update variable
release mutex
...they're simplified (and made faster) by simply having the spin on CAS do the variable update directly. Of course, you may find the work to calculate new value from old painful to repeat speculatively, but unless there's a LOT of contention you're not wasting that effort often.
This ability to update only a single location in memory has far-reaching implications, and work-arounds can require some creativity. For example, if you had a container using lock-free algorithms, you may decide to calculate a potential change to an element in the container, but can't sync that with updating a size variable elsewhere in memory. You may need to live without size, or be able to use an approximate size where you do a CAS-spin to increment or decrement the size later, but any given read of size may be slightly wrong. You may need to merge two logically-related data structures - such as a free list and the element-container - to share an index, then bit-pack the core fields for each into the same atomically-sized word at the start of each record. These kinds of data optimisations can be very invasive, and sometimes won't get you the behavioural characteristics you'd like. Mutexes et al are much easier in this regard, and at least you know you won't need a rewrite to mutexes if requirements evolve just that step too far. That said, clever use of a lock-free approach really can be adequate for a lot of needs, and yield a very gratifying performance and scalability improvement.
A core (good) consequence of lock-free algorithms is that one thread can't be holding the mutex then happen to get swapped out by the scheduler, such that other threads can't work until it resumes; rather - with CAS - they can spin safely and efficiently without an OS fallback option.
Things that lock free algorithms can be good for include updating usage/reference counters, modifying pointers to cleanly switch the pointed-to data, free lists, linked lists, marking hash-table buckets used/unused, and load-balancing. Many others of course.
As you say, simply hiding use of mutexes behind some API is not lock free.
There are a lot of different approaches to synchronization. There are various variants of message-passing (for example, CSP) or transactional memory.
Both of these may be implemented using locks, but that's an implementation detail.
And then of course, for some purposes, there are lock-free algorithms or data-structures, which make do with just a few atomic instructions (such as compare-and-swap), but this isn't really a general-purpose replacement for locks.
There are several implementations of some data structures, which can be implemented in a lock free configuration. For example, the producer/consumer pattern can often be implemented using lock-free linked list structures.
However, most lock-free solutions require significant thought on the part of the person designing the specific program/specific problem domain. They aren't generally applicable for all problems. For examples of such implementations, take a look at Intel's Threading Building Blocks library.
Most important to note is that no lock-free solution is free. You're going to give something up to make that work, at the bare minimum in implementation complexity, and probably performance in scenarios where you're running on a single core (for example, a linked list is MUCH slower than a vector). Make sure you benchmark before using lock free on the base assumption that it would be faster.
Side note: I really hope you're not using condition variables, because there's no way to ensure that their access operates as you wish in C and C++.
Yet another library to add to your reading list: Fast Flow
What's interesting in your case is that they are based on lock-free queues. They have implemented a simple lock-free queue and then have built more complex queues out of it.
And since the code is free, you can peruse it and get the code for the lock-free queue, which is far from trivial to get right.
I'm looking for a associative container of some sort that provides safe concurrent read & write access provided you're never simultaneously reading and writing the same element.
Basically I have this setup:
Thread 1: Create A, Write A to container, Send A over the network.
Thread 2: Receive response to A, Read A from container, do some processing.
I can guarantee that we only ever write A once, though we may receive multiple responses for A which will be processed serially. This also guarantees that we never read and write A at the same time, since we can only receive a response to A after sending it.
So basically I'm looking for a container where writing to an element doesn't mess with any other elements. For example, std::map (or any other tree-based implementation) does not satisfy this condition because its underlying implementation is a red-black tree, so any given write may rebalance the tree and blow up any concurrent read operations.
I think that std::hash_map or boost::unordered_set may work for this, just based on my assumption that a normal hash table implementation would satisfy my criteria, but I'm not positive and I can't find any documentation that would tell me. Has anybody else tried using these similarly?
The STL won't provide any solid guarantees about threads, since the C++ standard doesn't mention threads at all. I don't know about boost, but I'd be surprised if its containers made any concurrency guarantees.
What about concurrent_hash_map from TBB? I found this in this related SO question.
Common hash table implementation have rehashing when the number of stored element increase, so that's probably not an option excepted if you know that this doesn't happen.
I'd look at structures used for functional languages (for instance look at http://www.cs.cmu.edu/~rwh/theses/okasaki.pdf) but note that those I'm currently thinking about depend on garbage collection.